diff mbox series

[ovs-dev,v2] netdev-dpdk: Refactor the DPDK transmit path.

Message ID 20210110030505.325722-1-fbl@sysclose.org
State Superseded
Headers show
Series [ovs-dev,v2] netdev-dpdk: Refactor the DPDK transmit path. | expand

Commit Message

Flavio Leitner Jan. 10, 2021, 3:05 a.m. UTC
This patch split out the common code between vhost and
dpdk transmit paths to shared functions to simplify the
code and fix an issue.

The issue is that the packet coming from non-DPDK device
and egressing on a DPDK device currently skips the hwol
preparation.

This also have the side effect of leaving only the dpdk
transmit code under the txq lock.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---

V2:
   - mentioned the tx lock change in the commit message.
   - fixed packet leak when copy fails.
   - moved pkt_cnt = cnt two lines up.

I tested the following scenarios with iperf and iperf -R
and noticed no performance regression:
          IPERF       VM-Bridge  1.02x
          IPERF           VM-VM  1.04x
          IPERF      VM-ExtHost  1.00x
          IPERF           VM-NS  1.01x
          IPERF  VM-VLAN-Bridge  1.03x
          IPERF      VM-VLAN-VM  1.03x
          IPERF VM-VLAN-ExtHost  1.01x
          IPERF      VM-VLAN-NS  1.02x
          IPERF        VM-V6-VM  1.03x
          IPERF   VM-V6-ExtHost  1.00x
          IPERF        VM-V6-NS  1.01x
        IPERF-R       VM-Bridge  1.01x
        IPERF-R           VM-VM  1.04x
        IPERF-R      VM-ExtHost  1.10x
        IPERF-R           VM-NS  1.01x
        IPERF-R  VM-VLAN-Bridge  1.03x
        IPERF-R      VM-VLAN-VM  1.02x
        IPERF-R VM-VLAN-ExtHost  1.08x
        IPERF-R      VM-VLAN-NS  1.02x
        IPERF-R        VM-V6-VM  1.00x
        IPERF-R   VM-V6-ExtHost  1.11x
        IPERF-R        VM-V6-NS  1.00x

Now using trex, 64byte packet, PVP:
Original: 3.6Mpps
  avg. packets per output batch: 32.00
  idle cycles: 0 (0.00%)
  avg cycles per packet: 304.92 (8331383020/27323150)
  avg processing cycles per packet: 304.92 (8331383020/27323150)

Patched: 3.6Mpps
  avg. packets per output batch: 32.00
  idle cycles: 0 (0.00%)
  avg cycles per packet: 304.08 (21875784116/71941516)
  avg processing cycles per packet: 304.08 (21875784116/71941516)

 lib/netdev-dpdk.c | 335 +++++++++++++++++++---------------------------
 1 file changed, 139 insertions(+), 196 deletions(-)

Comments

Pai G, Sunil Jan. 11, 2021, 3:49 p.m. UTC | #1
Hi Flavio,


> -----Original Message-----
> From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner
> Sent: Sunday, January 10, 2021 8:35 AM
> To: dev@openvswitch.org
> Cc: David Marchand <david.marchand@redhat.com>; Flavio Leitner
> <fbl@sysclose.org>
> Subject: [ovs-dev] [PATCH v2] netdev-dpdk: Refactor the DPDK transmit
> path.
> 
> This patch split out the common code between vhost and dpdk transmit
> paths to shared functions to simplify the code and fix an issue.
> 
> The issue is that the packet coming from non-DPDK device and egressing on a
> DPDK device currently skips the hwol preparation.
> 
> This also have the side effect of leaving only the dpdk transmit code under
> the txq lock.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
> 
> V2:
>    - mentioned the tx lock change in the commit message.
>    - fixed packet leak when copy fails.
>    - moved pkt_cnt = cnt two lines up.
> 
> I tested the following scenarios with iperf and iperf -R and noticed no
> performance regression:
>           IPERF       VM-Bridge  1.02x
>           IPERF           VM-VM  1.04x
>           IPERF      VM-ExtHost  1.00x
>           IPERF           VM-NS  1.01x
>           IPERF  VM-VLAN-Bridge  1.03x
>           IPERF      VM-VLAN-VM  1.03x
>           IPERF VM-VLAN-ExtHost  1.01x
>           IPERF      VM-VLAN-NS  1.02x
>           IPERF        VM-V6-VM  1.03x
>           IPERF   VM-V6-ExtHost  1.00x
>           IPERF        VM-V6-NS  1.01x
>         IPERF-R       VM-Bridge  1.01x
>         IPERF-R           VM-VM  1.04x
>         IPERF-R      VM-ExtHost  1.10x
>         IPERF-R           VM-NS  1.01x
>         IPERF-R  VM-VLAN-Bridge  1.03x
>         IPERF-R      VM-VLAN-VM  1.02x
>         IPERF-R VM-VLAN-ExtHost  1.08x
>         IPERF-R      VM-VLAN-NS  1.02x
>         IPERF-R        VM-V6-VM  1.00x
>         IPERF-R   VM-V6-ExtHost  1.11x
>         IPERF-R        VM-V6-NS  1.00x
> 
> Now using trex, 64byte packet, PVP:
> Original: 3.6Mpps
>   avg. packets per output batch: 32.00
>   idle cycles: 0 (0.00%)
>   avg cycles per packet: 304.92 (8331383020/27323150)
>   avg processing cycles per packet: 304.92 (8331383020/27323150)
> 
> Patched: 3.6Mpps
>   avg. packets per output batch: 32.00
>   idle cycles: 0 (0.00%)
>   avg cycles per packet: 304.08 (21875784116/71941516)
>   avg processing cycles per packet: 304.08 (21875784116/71941516)


Ran the following tests in VSPERF for QoS and they all pass:
ovsdpdk_qos_create_phy_port	
ovsdpdk_qos_delete_phy_port	
ovsdpdk_qos_create_vport	
ovsdpdk_qos_delete_vport	
ovsdpdk_qos_create_no_cir	
ovsdpdk_qos_create_no_cbs

For performance tests:
We see a slight dip in performance for physical ports ~1% in phy2phy_tput test in VSPERF with the patch(v1) for 64byte packet
and PVP test results are in line with your observations.
NIC: Fortville 10G X710


Performance with different traffic profiles when deployed with 32VM's +1M flows + vxlan enabled :
With burst mode : ~1% increase in performance
as compared to 
scatter mode: ~4% decrease in performance.

[snipped]

Thanks , 
Sunil
Intel

> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Flavio Leitner Jan. 11, 2021, 7:36 p.m. UTC | #2
Hi Sunil,

On Mon, Jan 11, 2021 at 03:49:56PM +0000, Pai G, Sunil wrote:
> Hi Flavio,
> 
> 
> > -----Original Message-----
> > From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner
> > Sent: Sunday, January 10, 2021 8:35 AM
> > To: dev@openvswitch.org
> > Cc: David Marchand <david.marchand@redhat.com>; Flavio Leitner
> > <fbl@sysclose.org>
> > Subject: [ovs-dev] [PATCH v2] netdev-dpdk: Refactor the DPDK transmit
> > path.
> > 
> > This patch split out the common code between vhost and dpdk transmit
> > paths to shared functions to simplify the code and fix an issue.
> > 
> > The issue is that the packet coming from non-DPDK device and egressing on a
> > DPDK device currently skips the hwol preparation.
> > 
> > This also have the side effect of leaving only the dpdk transmit code under
> > the txq lock.
> > 
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > ---
> > 
> > V2:
> >    - mentioned the tx lock change in the commit message.
> >    - fixed packet leak when copy fails.
> >    - moved pkt_cnt = cnt two lines up.
> > 
> > I tested the following scenarios with iperf and iperf -R and noticed no
> > performance regression:
> >           IPERF       VM-Bridge  1.02x
> >           IPERF           VM-VM  1.04x
> >           IPERF      VM-ExtHost  1.00x
> >           IPERF           VM-NS  1.01x
> >           IPERF  VM-VLAN-Bridge  1.03x
> >           IPERF      VM-VLAN-VM  1.03x
> >           IPERF VM-VLAN-ExtHost  1.01x
> >           IPERF      VM-VLAN-NS  1.02x
> >           IPERF        VM-V6-VM  1.03x
> >           IPERF   VM-V6-ExtHost  1.00x
> >           IPERF        VM-V6-NS  1.01x
> >         IPERF-R       VM-Bridge  1.01x
> >         IPERF-R           VM-VM  1.04x
> >         IPERF-R      VM-ExtHost  1.10x
> >         IPERF-R           VM-NS  1.01x
> >         IPERF-R  VM-VLAN-Bridge  1.03x
> >         IPERF-R      VM-VLAN-VM  1.02x
> >         IPERF-R VM-VLAN-ExtHost  1.08x
> >         IPERF-R      VM-VLAN-NS  1.02x
> >         IPERF-R        VM-V6-VM  1.00x
> >         IPERF-R   VM-V6-ExtHost  1.11x
> >         IPERF-R        VM-V6-NS  1.00x
> > 
> > Now using trex, 64byte packet, PVP:
> > Original: 3.6Mpps
> >   avg. packets per output batch: 32.00
> >   idle cycles: 0 (0.00%)
> >   avg cycles per packet: 304.92 (8331383020/27323150)
> >   avg processing cycles per packet: 304.92 (8331383020/27323150)
> > 
> > Patched: 3.6Mpps
> >   avg. packets per output batch: 32.00
> >   idle cycles: 0 (0.00%)
> >   avg cycles per packet: 304.08 (21875784116/71941516)
> >   avg processing cycles per packet: 304.08 (21875784116/71941516)
> 
> 
> Ran the following tests in VSPERF for QoS and they all pass:

Great, thanks for testing the patch.

> ovsdpdk_qos_create_phy_port	
> ovsdpdk_qos_delete_phy_port	
> ovsdpdk_qos_create_vport	
> ovsdpdk_qos_delete_vport	
> ovsdpdk_qos_create_no_cir	
> ovsdpdk_qos_create_no_cbs
> 
> For performance tests:
> We see a slight dip in performance for physical ports ~1% in phy2phy_tput test in VSPERF with the patch(v1) for 64byte packet

I tried P2P here (v2) with trex, single flow, single queue,
single PMD and I got this:

original: 12.88Mpps/6.6Gbps
  idle cycles: 0 (0.00%)
  processing cycles: 14466174204 (100.00%)
  avg cycles per packet: 170.73 (14466174204/84730796)
  avg processing cycles per packet: 170.73 (14466174204/84730796)


patched: 13.00Mpps/6.65Mpps
  idle cycles: 0 (0.00%)
  processing cycles: 3783269716 (100.00%)
  avg cycles per packet: 169.49 (3783269716/22321968)
  avg processing cycles per packet: 169.49 (3783269716/22321968)

So, I see a small performance gain. Can you provide a more detailed
test description? I wonder if we are testing the same thing.

> and PVP test results are in line with your observations.
> NIC: Fortville 10G X710

Great.

> Performance with different traffic profiles when deployed with 32VM's +1M flows + vxlan enabled :
> With burst mode : ~1% increase in performance

That is aligned with above.

> as compared to 
> scatter mode: ~4% decrease in performance.

Hm, that is unexpected. Is this reproducible?
How are you switching between burst and scatter mode?

Thanks,
fbl
David Marchand Jan. 12, 2021, 7:57 a.m. UTC | #3
On Sun, Jan 10, 2021 at 4:05 AM Flavio Leitner <fbl@sysclose.org> wrote:
>
> This patch split out the common code between vhost and
> dpdk transmit paths to shared functions to simplify the
> code and fix an issue.
>
> The issue is that the packet coming from non-DPDK device
> and egressing on a DPDK device currently skips the hwol
> preparation.
>
> This also have the side effect of leaving only the dpdk

nit: has*

> transmit code under the txq lock.
>
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>

Reviewed-by: David Marchand <david.marchand@redhat.com>

Thanks.
Kovacevic, Marko Jan. 13, 2021, 10:51 a.m. UTC | #4
Hi Flavio,

> So, I see a small performance gain. Can you provide a more detailed
> test description? I wonder if we are testing the same thing.

VSPERF phy2phy_tput uses 2 dpdk pmds and performs zero packet loss test(RFC2544) with bidirectional traffic.

> 
> > and PVP test results are in line with your observations.
> > NIC: Fortville 10G X710
> 
> Great.
> 
> > Performance with different traffic profiles when deployed with 32VM's
> +1M flows + vxlan enabled :
> > With burst mode : ~1% increase in performance
> 
> That is aligned with above.
> 
> > as compared to
> > scatter mode: ~4% decrease in performance.
> 
> Hm, that is unexpected. Is this reproducible?

If you mean are the results reproducible then yes, 
I done a re-run of v1 and v2 again 

I ran my test against these commits to get baseline results and then ran v1 and v2 and compared results
DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0
OVS: 7f79ae2fb (HEAD -> master, origin/master, origin/HEAD) Documentation: Simplify the website main page.

V1:
DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0
OVS: 957132b3a (HEAD -> master) netdev-dpdk: Refactor the DPDK transmit path.
7f79ae2fb (origin/master, origin/HEAD) Documentation: Simplify the website main page.

V1: Burst= 1%
V1: Scatter -4%
As sunil reported already shows the same results again

As for V2 it didn't show any increase or decrease for both burst and scatter, 
And even between the v1 and v2 results from my second run it didn't show much difference on my test anyway.

> How are you switching between burst and scatter mode?

As for how were switching between burst and scatter, I just use two different traffic profiles & gather results.
So I run the test using burst profile restart the test then use scatter.

Traffic @ Phy NIC Rx:
Ether()/IP()/UDP()/VXLAN()/Ether()/IP()

Burst: on the outer ip we do a burst of 32 packets with same ip then switch for next 32 and so on 
Scatter: for scatter we do incrementally for 32 

And we do this for 1M flows

I hope this answers your question
> 
> Thanks,
> fbl

Thanks,
Marko K
Flavio Leitner Jan. 13, 2021, 6:01 p.m. UTC | #5
Hi Marko,

On Wed, Jan 13, 2021 at 10:51:03AM +0000, Kovacevic, Marko wrote:
> Hi Flavio,
> 
> > So, I see a small performance gain. Can you provide a more detailed
> > test description? I wonder if we are testing the same thing.
> 
> VSPERF phy2phy_tput uses 2 dpdk pmds and performs zero packet loss test(RFC2544) with bidirectional traffic.

Ok.

> > > and PVP test results are in line with your observations.
> > > NIC: Fortville 10G X710
> > 
> > Great.
> > 
> > > Performance with different traffic profiles when deployed with 32VM's
> > +1M flows + vxlan enabled :
> > > With burst mode : ~1% increase in performance
> > 
> > That is aligned with above.
> > 
> > > as compared to
> > > scatter mode: ~4% decrease in performance.
> > 
> > Hm, that is unexpected. Is this reproducible?
> 
> If you mean are the results reproducible then yes, 
> I done a re-run of v1 and v2 again 

Thanks for checking.

> I ran my test against these commits to get baseline results and then ran v1 and v2 and compared results
> DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0
> OVS: 7f79ae2fb (HEAD -> master, origin/master, origin/HEAD) Documentation: Simplify the website main page.
> 
> V1:
> DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0
> OVS: 957132b3a (HEAD -> master) netdev-dpdk: Refactor the DPDK transmit path.
> 7f79ae2fb (origin/master, origin/HEAD) Documentation: Simplify the website main page.
> 
> V1: Burst= 1%
> V1: Scatter -4%
> As sunil reported already shows the same results again

Alright. I will try to reproduce in my lab to understand the root
cause.


> As for V2 it didn't show any increase or decrease for both burst and scatter, 
> And even between the v1 and v2 results from my second run it didn't show much difference on my test anyway.
> 
> > How are you switching between burst and scatter mode?
> 
> As for how were switching between burst and scatter, I just use two different traffic profiles & gather results.
> So I run the test using burst profile restart the test then use scatter.
> 
> Traffic @ Phy NIC Rx:
> Ether()/IP()/UDP()/VXLAN()/Ether()/IP()
> 
> Burst: on the outer ip we do a burst of 32 packets with same ip then switch for next 32 and so on 
> Scatter: for scatter we do incrementally for 32 

Glad that I asked because I thought you were talking about the
NIC mode with scatter enabled or not.

> And we do this for 1M flows
> 
> I hope this answers your question

They are helpful. I will try to reproduce the -4% results in my
lab as a next step. It may take some time.

Thanks for testing and providing feedbacks!
Mike Pattrick Jan. 5, 2022, 8:01 p.m. UTC | #6
Hello Flavio,

Great patch, I think you really did a lot to improve the code here and
I think that's borne out by the consistent performance improvements
across multiple tests.

Regarding the 4% regression that Intel detected, I found the following
white paper to describe the "scatter" test:

https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf

This document calls out the following key points:

The original test was summarized as:
- 32 VMs with one million flows.
- Test runs on four physical cores for OVS and 10 hyper-threaded cores
for TestPMD
- An Ixia pitches traffic at a sub 0.1% loss rate
- The server catches traffic with a E810-C 100G
- The traffic's profile is: Ether()/IP()/UDP()/VXLAN()/Ether()/IP() 
- On the outer IP, the source address changes incrementally across the
32 instances
- The destination address remains the same on the outer IP.
- The inner source IP remains
- The inner destination address increments to create the one million
flows for the test
- EMC and SMC were disabled

I could not reproduce this test exactly because I don't have access to
the same hardware - notably the Intel NIC and an Ixia - and I didn't
want to create an environment that wouldn't be reproduced in real world
scenarios. I did pin VM and TXQs/RXQs threads to cores, but I didn't
optimize the setup nearly to the extant that the white paper described.
My test setup consisted of two Fedora 35 servers directly connected
across Mellanox5E cards with Trex pitching traffic and TestPMD
reflecting it.

In my test I was still able to reproduce a similar performance penalty.
I found that the key factors was the combination of VXLAN and a large
number of flows. So once I had a setup that could reproduce close to
the 4% penalty I stopped modifying my test framework and started
searching for the slow code.

I didn't see any obvious issues in the code that should cause a
significant slowdown, in fact, most of the code is identical or
slightly improved. So to help my analysis, I created several variations
of your patch reverting small aspects of the change and benchmarked
each variation.

Because the difference in performance across each variation was so
minor, I took a lot of samples. I pitched traffic over one million
flows for 240 seconds and averaged out the throughput, I then repeated
this process a total of five times for each patch. Finally, I repeated
the whole process three times to produce 15 data points per patch.

The best results came from the patch enclosed below, with the code from
netdev_dpdk_common_send() protected by the splinlock, as it is in the
pre-patch code. This yielded a 2.7% +/- 0.64 performance boost over the
master branch.


Cheers,
Michael


diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index bc1633663..5db5d7e2a 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2777,13 +2777,13 @@ netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
         return 0;
     }
 
-    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
-
     if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) {
         COVERAGE_INC(vhost_tx_contention);
         rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
     }
 
+    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
+
     pkts = (struct rte_mbuf **) batch->packets;
     vhost_batch_cnt = cnt;
     retries = 0;
@@ -2843,13 +2843,15 @@ netdev_dpdk_eth_send(struct netdev *netdev, int qid,
         return 0;
     }
 
-    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
-    dropped = batch_cnt - cnt;
     if (OVS_UNLIKELY(concurrent_txq)) {
         qid = qid % dev->up.n_txq;
         rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
     }
 
+    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
+
+    dropped = batch_cnt - cnt;
+
     dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt);
     if (OVS_UNLIKELY(dropped)) {
         struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;


On Sun, 2021-01-10 at 00:05 -0300, Flavio Leitner wrote:
> This patch split out the common code between vhost and
> dpdk transmit paths to shared functions to simplify the
> code and fix an issue.
> 
> The issue is that the packet coming from non-DPDK device
> and egressing on a DPDK device currently skips the hwol
> preparation.
> 
> This also have the side effect of leaving only the dpdk
> transmit code under the txq lock.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> Reviewed-by: David Marchand <david.marchand@redhat.com>
> ---
> 
> V2:
>    - mentioned the tx lock change in the commit message.
>    - fixed packet leak when copy fails.
>    - moved pkt_cnt = cnt two lines up.
> 
> I tested the following scenarios with iperf and iperf -R
> and noticed no performance regression:
>           IPERF       VM-Bridge  1.02x
>           IPERF           VM-VM  1.04x
>           IPERF      VM-ExtHost  1.00x
>           IPERF           VM-NS  1.01x
>           IPERF  VM-VLAN-Bridge  1.03x
>           IPERF      VM-VLAN-VM  1.03x
>           IPERF VM-VLAN-ExtHost  1.01x
>           IPERF      VM-VLAN-NS  1.02x
>           IPERF        VM-V6-VM  1.03x
>           IPERF   VM-V6-ExtHost  1.00x
>           IPERF        VM-V6-NS  1.01x
>         IPERF-R       VM-Bridge  1.01x
>         IPERF-R           VM-VM  1.04x
>         IPERF-R      VM-ExtHost  1.10x
>         IPERF-R           VM-NS  1.01x
>         IPERF-R  VM-VLAN-Bridge  1.03x
>         IPERF-R      VM-VLAN-VM  1.02x
>         IPERF-R VM-VLAN-ExtHost  1.08x
>         IPERF-R      VM-VLAN-NS  1.02x
>         IPERF-R        VM-V6-VM  1.00x
>         IPERF-R   VM-V6-ExtHost  1.11x
>         IPERF-R        VM-V6-NS  1.00x
> 
> Now using trex, 64byte packet, PVP:
> Original: 3.6Mpps
>   avg. packets per output batch: 32.00
>   idle cycles: 0 (0.00%)
>   avg cycles per packet: 304.92 (8331383020/27323150)
>   avg processing cycles per packet: 304.92 (8331383020/27323150)
> 
> Patched: 3.6Mpps
>   avg. packets per output batch: 32.00
>   idle cycles: 0 (0.00%)
>   avg cycles per packet: 304.08 (21875784116/71941516)
>   avg processing cycles per packet: 304.08 (21875784116/71941516)
> 
>  lib/netdev-dpdk.c | 335 +++++++++++++++++++-------------------------
> --
>  1 file changed, 139 insertions(+), 196 deletions(-)
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 2640a421a..a1437db4d 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -2585,90 +2585,6 @@ netdev_dpdk_vhost_update_tx_counters(struct
> netdev_dpdk *dev,
>      }
>  }
>  
> -static void
> -__netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> -                         struct dp_packet **pkts, int cnt)
> -{
> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> -    struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
> -    struct netdev_dpdk_sw_stats sw_stats_add;
> -    unsigned int n_packets_to_free = cnt;
> -    unsigned int total_packets = cnt;
> -    int i, retries = 0;
> -    int max_retries = VHOST_ENQ_RETRY_MIN;
> -    int vid = netdev_dpdk_get_vid(dev);
> -
> -    qid = dev->tx_q[qid % netdev->n_txq].map;
> -
> -    if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
> -                     || !(dev->flags & NETDEV_UP))) {
> -        rte_spinlock_lock(&dev->stats_lock);
> -        dev->stats.tx_dropped+= cnt;
> -        rte_spinlock_unlock(&dev->stats_lock);
> -        goto out;
> -    }
> -
> -    if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev-
> >tx_q[qid].tx_lock))) {
> -        COVERAGE_INC(vhost_tx_contention);
> -        rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
> -    }
> -
> -    sw_stats_add.tx_invalid_hwol_drops = cnt;
> -    if (userspace_tso_enabled()) {
> -        cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
> -    }
> -
> -    sw_stats_add.tx_invalid_hwol_drops -= cnt;
> -    sw_stats_add.tx_mtu_exceeded_drops = cnt;
> -    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> -    sw_stats_add.tx_mtu_exceeded_drops -= cnt;
> -
> -    /* Check has QoS has been configured for the netdev */
> -    sw_stats_add.tx_qos_drops = cnt;
> -    cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt, true);
> -    sw_stats_add.tx_qos_drops -= cnt;
> -
> -    n_packets_to_free = cnt;
> -
> -    do {
> -        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
> -        unsigned int tx_pkts;
> -
> -        tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, cur_pkts,
> cnt);
> -        if (OVS_LIKELY(tx_pkts)) {
> -            /* Packets have been sent.*/
> -            cnt -= tx_pkts;
> -            /* Prepare for possible retry.*/
> -            cur_pkts = &cur_pkts[tx_pkts];
> -            if (OVS_UNLIKELY(cnt && !retries)) {
> -                /*
> -                 * Read max retries as there are packets not sent
> -                 * and no retries have already occurred.
> -                 */
> -                atomic_read_relaxed(&dev->vhost_tx_retries_max,
> &max_retries);
> -            }
> -        } else {
> -            /* No packets sent - do not retry.*/
> -            break;
> -        }
> -    } while (cnt && (retries++ < max_retries));
> -
> -    rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
> -
> -    sw_stats_add.tx_failure_drops = cnt;
> -    sw_stats_add.tx_retries = MIN(retries, max_retries);
> -
> -    rte_spinlock_lock(&dev->stats_lock);
> -    netdev_dpdk_vhost_update_tx_counters(dev, pkts, total_packets,
> -                                         &sw_stats_add);
> -    rte_spinlock_unlock(&dev->stats_lock);
> -
> -out:
> -    for (i = 0; i < n_packets_to_free; i++) {
> -        dp_packet_delete(pkts[i]);
> -    }
> -}
> -
>  static void
>  netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
>  {
> @@ -2783,76 +2699,70 @@ dpdk_copy_dp_packet_to_mbuf(struct
> rte_mempool *mp, struct dp_packet *pkt_orig)
>      return pkt_dest;
>  }
>  
> -/* Tx function. Transmit packets indefinitely */
> -static void
> -dpdk_do_tx_copy(struct netdev *netdev, int qid, struct
> dp_packet_batch *batch)
> +/* Replace packets in a 'batch' with their corresponding copies
> using
> + * DPDK memory.
> + *
> + * Returns the number of good packets in the batch. */
> +static size_t
> +dpdk_copy_batch_to_mbuf(struct netdev *netdev, struct
> dp_packet_batch *batch)
>      OVS_NO_THREAD_SAFETY_ANALYSIS
>  {
> -    const size_t batch_cnt = dp_packet_batch_size(batch);
> -#if !defined(__CHECKER__) && !defined(_WIN32)
> -    const size_t PKT_ARRAY_SIZE = batch_cnt;
> -#else
> -    /* Sparse or MSVC doesn't like variable length array. */
> -    enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
> -#endif
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> -    struct dp_packet *pkts[PKT_ARRAY_SIZE];
> -    struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> -    uint32_t cnt = batch_cnt;
> -    uint32_t dropped = 0;
> -    uint32_t tx_failure = 0;
> -    uint32_t mtu_drops = 0;
> -    uint32_t qos_drops = 0;
> -
> -    if (dev->type != DPDK_DEV_VHOST) {
> -        /* Check if QoS has been configured for this netdev. */
> -        cnt = netdev_dpdk_qos_run(dev, (struct rte_mbuf **) batch-
> >packets,
> -                                  batch_cnt, false);
> -        qos_drops = batch_cnt - cnt;
> -    }
> -
> -    uint32_t txcnt = 0;
> -
> -    for (uint32_t i = 0; i < cnt; i++) {
> -        struct dp_packet *packet = batch->packets[i];
> -        uint32_t size = dp_packet_size(packet);
> -
> -        if (size > dev->max_packet_len
> -            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> size,
> -                         dev->max_packet_len);
> -            mtu_drops++;
> -            continue;
> -        }
> +    size_t i, size = dp_packet_batch_size(batch);
> +    struct dp_packet *packet;
>  
> -        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp,
> packet);
> -        if (OVS_UNLIKELY(!pkts[txcnt])) {
> -            dropped = cnt - i;
> -            break;
> -        }
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> +        if (OVS_UNLIKELY(packet->source == DPBUF_DPDK)) {
> +            dp_packet_batch_refill(batch, packet, i);
> +        } else {
> +            struct dp_packet *pktcopy;
>  
> -        txcnt++;
> -    }
> +            pktcopy = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp,
> packet);
> +            if (pktcopy) {
> +                dp_packet_batch_refill(batch, pktcopy, i);
> +            }
>  
> -    if (OVS_LIKELY(txcnt)) {
> -        if (dev->type == DPDK_DEV_VHOST) {
> -            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
> -        } else {
> -            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> -                                                   (struct rte_mbuf
> **)pkts,
> -                                                   txcnt);
> +            dp_packet_delete(packet);
>          }
>      }
>  
> -    dropped += qos_drops + mtu_drops + tx_failure;
> -    if (OVS_UNLIKELY(dropped)) {
> -        rte_spinlock_lock(&dev->stats_lock);
> -        dev->stats.tx_dropped += dropped;
> -        sw_stats->tx_failure_drops += tx_failure;
> -        sw_stats->tx_mtu_exceeded_drops += mtu_drops;
> -        sw_stats->tx_qos_drops += qos_drops;
> -        rte_spinlock_unlock(&dev->stats_lock);
> +    return dp_packet_batch_size(batch);
> +}
> +
> +static size_t
> +netdev_dpdk_common_send(struct netdev *netdev, struct
> dp_packet_batch *batch,
> +                        struct netdev_dpdk_sw_stats *stats)
> +{
> +    struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    size_t cnt, pkt_cnt = dp_packet_batch_size(batch);
> +
> +    memset(stats, 0, sizeof *stats);
> +
> +    /* Copy dp-packets to mbufs. */
> +    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
> +        cnt = dpdk_copy_batch_to_mbuf(netdev, batch);
> +        stats->tx_failure_drops += pkt_cnt - cnt;
> +        pkt_cnt = cnt;
>      }
> +
> +    /* Drop oversized packets. */
> +    cnt = netdev_dpdk_filter_packet_len(dev, pkts, pkt_cnt);
> +    stats->tx_mtu_exceeded_drops += pkt_cnt - cnt;
> +    pkt_cnt = cnt;
> +
> +    /* Prepare each mbuf for hardware offloading. */
> +    if (userspace_tso_enabled()) {
> +        cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, pkt_cnt);
> +        stats->tx_invalid_hwol_drops += pkt_cnt - cnt;
> +        pkt_cnt = cnt;
> +    }
> +
> +    /* Apply Quality of Service policy. */
> +    cnt = netdev_dpdk_qos_run(dev, pkts, pkt_cnt, true);
> +    stats->tx_qos_drops += pkt_cnt - cnt;
> +
> +    return cnt;
>  }
>  
>  static int
> @@ -2860,82 +2770,115 @@ netdev_dpdk_vhost_send(struct netdev
> *netdev, int qid,
>                         struct dp_packet_batch *batch,
>                         bool concurrent_txq OVS_UNUSED)
>  {
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    int max_retries = VHOST_ENQ_RETRY_MIN;
> +    int cnt, batch_cnt, vhost_batch_cnt;
> +    int vid = netdev_dpdk_get_vid(dev);
> +    struct netdev_dpdk_sw_stats stats;
> +    struct rte_mbuf **pkts;
> +    int retries;
>  
> -    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
> -        dpdk_do_tx_copy(netdev, qid, batch);
> +    batch_cnt = cnt = dp_packet_batch_size(batch);
> +    qid = dev->tx_q[qid % netdev->n_txq].map;
> +    if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
> +                     || !(dev->flags & NETDEV_UP))) {
> +        rte_spinlock_lock(&dev->stats_lock);
> +        dev->stats.tx_dropped += cnt;
> +        rte_spinlock_unlock(&dev->stats_lock);
>          dp_packet_delete_batch(batch, true);
> -    } else {
> -        __netdev_dpdk_vhost_send(netdev, qid, batch->packets,
> -                                 dp_packet_batch_size(batch));
> +        return 0;
> +    }
> +
> +    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
> +
> +    if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev-
> >tx_q[qid].tx_lock))) {
> +        COVERAGE_INC(vhost_tx_contention);
> +        rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>      }
> +
> +    pkts = (struct rte_mbuf **) batch->packets;
> +    vhost_batch_cnt = cnt;
> +    retries = 0;
> +    do {
> +        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
> +        int tx_pkts;
> +
> +        tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, pkts,
> cnt);
> +        if (OVS_LIKELY(tx_pkts)) {
> +            /* Packets have been sent.*/
> +            cnt -= tx_pkts;
> +            /* Prepare for possible retry.*/
> +            pkts = &pkts[tx_pkts];
> +            if (OVS_UNLIKELY(cnt && !retries)) {
> +                /*
> +                 * Read max retries as there are packets not sent
> +                 * and no retries have already occurred.
> +                 */
> +                atomic_read_relaxed(&dev->vhost_tx_retries_max,
> &max_retries);
> +            }
> +        } else {
> +            /* No packets sent - do not retry.*/
> +            break;
> +        }
> +    } while (cnt && (retries++ < max_retries));
> +
> +    rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
> +
> +    stats.tx_failure_drops += cnt;
> +    stats.tx_retries = MIN(retries, max_retries);
> +
> +    rte_spinlock_lock(&dev->stats_lock);
> +    netdev_dpdk_vhost_update_tx_counters(dev, batch->packets,
> batch_cnt,
> +                                         &stats);
> +    rte_spinlock_unlock(&dev->stats_lock);
> +
> +    pkts = (struct rte_mbuf **) batch->packets;
> +    for (int i = 0; i < vhost_batch_cnt; i++) {
> +        rte_pktmbuf_free(pkts[i]);
> +    }
> +
>      return 0;
>  }
>  
> -static inline void
> -netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
> -                   struct dp_packet_batch *batch,
> -                   bool concurrent_txq)
> +static int
> +netdev_dpdk_eth_send(struct netdev *netdev, int qid,
> +                     struct dp_packet_batch *batch, bool
> concurrent_txq)
>  {
> +    struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    int batch_cnt = dp_packet_batch_size(batch);
> +    struct netdev_dpdk_sw_stats stats;
> +    int cnt, dropped;
> +
>      if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
>          dp_packet_delete_batch(batch, true);
> -        return;
> +        return 0;
>      }
>  
> +    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
> +    dropped = batch_cnt - cnt;
>      if (OVS_UNLIKELY(concurrent_txq)) {
>          qid = qid % dev->up.n_txq;
>          rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>      }
>  
> -    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
> -        struct netdev *netdev = &dev->up;
> -
> -        dpdk_do_tx_copy(netdev, qid, batch);
> -        dp_packet_delete_batch(batch, true);
> -    } else {
> +    dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt);
> +    if (OVS_UNLIKELY(dropped)) {
>          struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> -        int dropped;
> -        int tx_failure, mtu_drops, qos_drops, hwol_drops;
> -        int batch_cnt = dp_packet_batch_size(batch);
> -        struct rte_mbuf **pkts = (struct rte_mbuf **) batch-
> >packets;
>  
> -        hwol_drops = batch_cnt;
> -        if (userspace_tso_enabled()) {
> -            batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts,
> batch_cnt);
> -        }
> -        hwol_drops -= batch_cnt;
> -        mtu_drops = batch_cnt;
> -        batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts,
> batch_cnt);
> -        mtu_drops -= batch_cnt;
> -        qos_drops = batch_cnt;
> -        batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true);
> -        qos_drops -= batch_cnt;
> -
> -        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts,
> batch_cnt);
> -
> -        dropped = tx_failure + mtu_drops + qos_drops + hwol_drops;
> -        if (OVS_UNLIKELY(dropped)) {
> -            rte_spinlock_lock(&dev->stats_lock);
> -            dev->stats.tx_dropped += dropped;
> -            sw_stats->tx_failure_drops += tx_failure;
> -            sw_stats->tx_mtu_exceeded_drops += mtu_drops;
> -            sw_stats->tx_qos_drops += qos_drops;
> -            sw_stats->tx_invalid_hwol_drops += hwol_drops;
> -            rte_spinlock_unlock(&dev->stats_lock);
> -        }
> +        rte_spinlock_lock(&dev->stats_lock);
> +        dev->stats.tx_dropped += dropped;
> +        sw_stats->tx_failure_drops += stats.tx_failure_drops;
> +        sw_stats->tx_mtu_exceeded_drops +=
> stats.tx_mtu_exceeded_drops;
> +        sw_stats->tx_qos_drops += stats.tx_qos_drops;
> +        sw_stats->tx_invalid_hwol_drops +=
> stats.tx_invalid_hwol_drops;
> +        rte_spinlock_unlock(&dev->stats_lock);
>      }
>  
>      if (OVS_UNLIKELY(concurrent_txq)) {
>          rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
>      }
> -}
> -
> -static int
> -netdev_dpdk_eth_send(struct netdev *netdev, int qid,
> -                     struct dp_packet_batch *batch, bool
> concurrent_txq)
> -{
> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>  
> -    netdev_dpdk_send__(dev, qid, batch, concurrent_txq);
>      return 0;
>  }
>
Flavio Leitner Jan. 12, 2022, 2:52 p.m. UTC | #7
Hello Sunil, Marko and Ian.

Mike worked to identify the reason for the performance issue
reported by you a while ago. He summarized below. I wonder
if you can give a try on his patch too and tell us if we are
on the right track.

Thanks,
fbl

On Wed, Jan 05, 2022 at 03:01:47PM -0500, Mike Pattrick wrote:
> Hello Flavio,
> 
> Great patch, I think you really did a lot to improve the code here and
> I think that's borne out by the consistent performance improvements
> across multiple tests.
> 
> Regarding the 4% regression that Intel detected, I found the following
> white paper to describe the "scatter" test:
> 
> https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf
> 
> This document calls out the following key points:
> 
> The original test was summarized as:
> - 32 VMs with one million flows.
> - Test runs on four physical cores for OVS and 10 hyper-threaded cores
> for TestPMD
> - An Ixia pitches traffic at a sub 0.1% loss rate
> - The server catches traffic with a E810-C 100G
> - The traffic's profile is: Ether()/IP()/UDP()/VXLAN()/Ether()/IP() 
> - On the outer IP, the source address changes incrementally across the
> 32 instances
> - The destination address remains the same on the outer IP.
> - The inner source IP remains
> - The inner destination address increments to create the one million
> flows for the test
> - EMC and SMC were disabled
> 
> I could not reproduce this test exactly because I don't have access to
> the same hardware - notably the Intel NIC and an Ixia - and I didn't
> want to create an environment that wouldn't be reproduced in real world
> scenarios. I did pin VM and TXQs/RXQs threads to cores, but I didn't
> optimize the setup nearly to the extant that the white paper described.
> My test setup consisted of two Fedora 35 servers directly connected
> across Mellanox5E cards with Trex pitching traffic and TestPMD
> reflecting it.
> 
> In my test I was still able to reproduce a similar performance penalty.
> I found that the key factors was the combination of VXLAN and a large
> number of flows. So once I had a setup that could reproduce close to
> the 4% penalty I stopped modifying my test framework and started
> searching for the slow code.
> 
> I didn't see any obvious issues in the code that should cause a
> significant slowdown, in fact, most of the code is identical or
> slightly improved. So to help my analysis, I created several variations
> of your patch reverting small aspects of the change and benchmarked
> each variation.
> 
> Because the difference in performance across each variation was so
> minor, I took a lot of samples. I pitched traffic over one million
> flows for 240 seconds and averaged out the throughput, I then repeated
> this process a total of five times for each patch. Finally, I repeated
> the whole process three times to produce 15 data points per patch.
> 
> The best results came from the patch enclosed below, with the code from
> netdev_dpdk_common_send() protected by the splinlock, as it is in the
> pre-patch code. This yielded a 2.7% +/- 0.64 performance boost over the
> master branch.
> 
> 
> Cheers,
> Michael
> 
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index bc1633663..5db5d7e2a 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -2777,13 +2777,13 @@ netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>          return 0;
>      }
>  
> -    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
> -
>      if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) {
>          COVERAGE_INC(vhost_tx_contention);
>          rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>      }
>  
> +    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
> +
>      pkts = (struct rte_mbuf **) batch->packets;
>      vhost_batch_cnt = cnt;
>      retries = 0;
> @@ -2843,13 +2843,15 @@ netdev_dpdk_eth_send(struct netdev *netdev, int qid,
>          return 0;
>      }
>  
> -    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
> -    dropped = batch_cnt - cnt;
>      if (OVS_UNLIKELY(concurrent_txq)) {
>          qid = qid % dev->up.n_txq;
>          rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>      }
>  
> +    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
> +
> +    dropped = batch_cnt - cnt;
> +
>      dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt);
>      if (OVS_UNLIKELY(dropped)) {
>          struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> 
> 
> On Sun, 2021-01-10 at 00:05 -0300, Flavio Leitner wrote:
> > This patch split out the common code between vhost and
> > dpdk transmit paths to shared functions to simplify the
> > code and fix an issue.
> > 
> > The issue is that the packet coming from non-DPDK device
> > and egressing on a DPDK device currently skips the hwol
> > preparation.
> > 
> > This also have the side effect of leaving only the dpdk
> > transmit code under the txq lock.
> > 
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > Reviewed-by: David Marchand <david.marchand@redhat.com>
> > ---
> > 
> > V2:
> >    - mentioned the tx lock change in the commit message.
> >    - fixed packet leak when copy fails.
> >    - moved pkt_cnt = cnt two lines up.
> > 
> > I tested the following scenarios with iperf and iperf -R
> > and noticed no performance regression:
> >           IPERF       VM-Bridge  1.02x
> >           IPERF           VM-VM  1.04x
> >           IPERF      VM-ExtHost  1.00x
> >           IPERF           VM-NS  1.01x
> >           IPERF  VM-VLAN-Bridge  1.03x
> >           IPERF      VM-VLAN-VM  1.03x
> >           IPERF VM-VLAN-ExtHost  1.01x
> >           IPERF      VM-VLAN-NS  1.02x
> >           IPERF        VM-V6-VM  1.03x
> >           IPERF   VM-V6-ExtHost  1.00x
> >           IPERF        VM-V6-NS  1.01x
> >         IPERF-R       VM-Bridge  1.01x
> >         IPERF-R           VM-VM  1.04x
> >         IPERF-R      VM-ExtHost  1.10x
> >         IPERF-R           VM-NS  1.01x
> >         IPERF-R  VM-VLAN-Bridge  1.03x
> >         IPERF-R      VM-VLAN-VM  1.02x
> >         IPERF-R VM-VLAN-ExtHost  1.08x
> >         IPERF-R      VM-VLAN-NS  1.02x
> >         IPERF-R        VM-V6-VM  1.00x
> >         IPERF-R   VM-V6-ExtHost  1.11x
> >         IPERF-R        VM-V6-NS  1.00x
> > 
> > Now using trex, 64byte packet, PVP:
> > Original: 3.6Mpps
> >   avg. packets per output batch: 32.00
> >   idle cycles: 0 (0.00%)
> >   avg cycles per packet: 304.92 (8331383020/27323150)
> >   avg processing cycles per packet: 304.92 (8331383020/27323150)
> > 
> > Patched: 3.6Mpps
> >   avg. packets per output batch: 32.00
> >   idle cycles: 0 (0.00%)
> >   avg cycles per packet: 304.08 (21875784116/71941516)
> >   avg processing cycles per packet: 304.08 (21875784116/71941516)
> > 
> >  lib/netdev-dpdk.c | 335 +++++++++++++++++++-------------------------
> > --
> >  1 file changed, 139 insertions(+), 196 deletions(-)
> > 
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index 2640a421a..a1437db4d 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -2585,90 +2585,6 @@ netdev_dpdk_vhost_update_tx_counters(struct
> > netdev_dpdk *dev,
> >      }
> >  }
> >  
> > -static void
> > -__netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> > -                         struct dp_packet **pkts, int cnt)
> > -{
> > -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > -    struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
> > -    struct netdev_dpdk_sw_stats sw_stats_add;
> > -    unsigned int n_packets_to_free = cnt;
> > -    unsigned int total_packets = cnt;
> > -    int i, retries = 0;
> > -    int max_retries = VHOST_ENQ_RETRY_MIN;
> > -    int vid = netdev_dpdk_get_vid(dev);
> > -
> > -    qid = dev->tx_q[qid % netdev->n_txq].map;
> > -
> > -    if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
> > -                     || !(dev->flags & NETDEV_UP))) {
> > -        rte_spinlock_lock(&dev->stats_lock);
> > -        dev->stats.tx_dropped+= cnt;
> > -        rte_spinlock_unlock(&dev->stats_lock);
> > -        goto out;
> > -    }
> > -
> > -    if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev-
> > >tx_q[qid].tx_lock))) {
> > -        COVERAGE_INC(vhost_tx_contention);
> > -        rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
> > -    }
> > -
> > -    sw_stats_add.tx_invalid_hwol_drops = cnt;
> > -    if (userspace_tso_enabled()) {
> > -        cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
> > -    }
> > -
> > -    sw_stats_add.tx_invalid_hwol_drops -= cnt;
> > -    sw_stats_add.tx_mtu_exceeded_drops = cnt;
> > -    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> > -    sw_stats_add.tx_mtu_exceeded_drops -= cnt;
> > -
> > -    /* Check has QoS has been configured for the netdev */
> > -    sw_stats_add.tx_qos_drops = cnt;
> > -    cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt, true);
> > -    sw_stats_add.tx_qos_drops -= cnt;
> > -
> > -    n_packets_to_free = cnt;
> > -
> > -    do {
> > -        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
> > -        unsigned int tx_pkts;
> > -
> > -        tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, cur_pkts,
> > cnt);
> > -        if (OVS_LIKELY(tx_pkts)) {
> > -            /* Packets have been sent.*/
> > -            cnt -= tx_pkts;
> > -            /* Prepare for possible retry.*/
> > -            cur_pkts = &cur_pkts[tx_pkts];
> > -            if (OVS_UNLIKELY(cnt && !retries)) {
> > -                /*
> > -                 * Read max retries as there are packets not sent
> > -                 * and no retries have already occurred.
> > -                 */
> > -                atomic_read_relaxed(&dev->vhost_tx_retries_max,
> > &max_retries);
> > -            }
> > -        } else {
> > -            /* No packets sent - do not retry.*/
> > -            break;
> > -        }
> > -    } while (cnt && (retries++ < max_retries));
> > -
> > -    rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
> > -
> > -    sw_stats_add.tx_failure_drops = cnt;
> > -    sw_stats_add.tx_retries = MIN(retries, max_retries);
> > -
> > -    rte_spinlock_lock(&dev->stats_lock);
> > -    netdev_dpdk_vhost_update_tx_counters(dev, pkts, total_packets,
> > -                                         &sw_stats_add);
> > -    rte_spinlock_unlock(&dev->stats_lock);
> > -
> > -out:
> > -    for (i = 0; i < n_packets_to_free; i++) {
> > -        dp_packet_delete(pkts[i]);
> > -    }
> > -}
> > -
> >  static void
> >  netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> >  {
> > @@ -2783,76 +2699,70 @@ dpdk_copy_dp_packet_to_mbuf(struct
> > rte_mempool *mp, struct dp_packet *pkt_orig)
> >      return pkt_dest;
> >  }
> >  
> > -/* Tx function. Transmit packets indefinitely */
> > -static void
> > -dpdk_do_tx_copy(struct netdev *netdev, int qid, struct
> > dp_packet_batch *batch)
> > +/* Replace packets in a 'batch' with their corresponding copies
> > using
> > + * DPDK memory.
> > + *
> > + * Returns the number of good packets in the batch. */
> > +static size_t
> > +dpdk_copy_batch_to_mbuf(struct netdev *netdev, struct
> > dp_packet_batch *batch)
> >      OVS_NO_THREAD_SAFETY_ANALYSIS
> >  {
> > -    const size_t batch_cnt = dp_packet_batch_size(batch);
> > -#if !defined(__CHECKER__) && !defined(_WIN32)
> > -    const size_t PKT_ARRAY_SIZE = batch_cnt;
> > -#else
> > -    /* Sparse or MSVC doesn't like variable length array. */
> > -    enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
> > -#endif
> >      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > -    struct dp_packet *pkts[PKT_ARRAY_SIZE];
> > -    struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> > -    uint32_t cnt = batch_cnt;
> > -    uint32_t dropped = 0;
> > -    uint32_t tx_failure = 0;
> > -    uint32_t mtu_drops = 0;
> > -    uint32_t qos_drops = 0;
> > -
> > -    if (dev->type != DPDK_DEV_VHOST) {
> > -        /* Check if QoS has been configured for this netdev. */
> > -        cnt = netdev_dpdk_qos_run(dev, (struct rte_mbuf **) batch-
> > >packets,
> > -                                  batch_cnt, false);
> > -        qos_drops = batch_cnt - cnt;
> > -    }
> > -
> > -    uint32_t txcnt = 0;
> > -
> > -    for (uint32_t i = 0; i < cnt; i++) {
> > -        struct dp_packet *packet = batch->packets[i];
> > -        uint32_t size = dp_packet_size(packet);
> > -
> > -        if (size > dev->max_packet_len
> > -            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> > -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> > size,
> > -                         dev->max_packet_len);
> > -            mtu_drops++;
> > -            continue;
> > -        }
> > +    size_t i, size = dp_packet_batch_size(batch);
> > +    struct dp_packet *packet;
> >  
> > -        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp,
> > packet);
> > -        if (OVS_UNLIKELY(!pkts[txcnt])) {
> > -            dropped = cnt - i;
> > -            break;
> > -        }
> > +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> > +        if (OVS_UNLIKELY(packet->source == DPBUF_DPDK)) {
> > +            dp_packet_batch_refill(batch, packet, i);
> > +        } else {
> > +            struct dp_packet *pktcopy;
> >  
> > -        txcnt++;
> > -    }
> > +            pktcopy = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp,
> > packet);
> > +            if (pktcopy) {
> > +                dp_packet_batch_refill(batch, pktcopy, i);
> > +            }
> >  
> > -    if (OVS_LIKELY(txcnt)) {
> > -        if (dev->type == DPDK_DEV_VHOST) {
> > -            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
> > -        } else {
> > -            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> > -                                                   (struct rte_mbuf
> > **)pkts,
> > -                                                   txcnt);
> > +            dp_packet_delete(packet);
> >          }
> >      }
> >  
> > -    dropped += qos_drops + mtu_drops + tx_failure;
> > -    if (OVS_UNLIKELY(dropped)) {
> > -        rte_spinlock_lock(&dev->stats_lock);
> > -        dev->stats.tx_dropped += dropped;
> > -        sw_stats->tx_failure_drops += tx_failure;
> > -        sw_stats->tx_mtu_exceeded_drops += mtu_drops;
> > -        sw_stats->tx_qos_drops += qos_drops;
> > -        rte_spinlock_unlock(&dev->stats_lock);
> > +    return dp_packet_batch_size(batch);
> > +}
> > +
> > +static size_t
> > +netdev_dpdk_common_send(struct netdev *netdev, struct
> > dp_packet_batch *batch,
> > +                        struct netdev_dpdk_sw_stats *stats)
> > +{
> > +    struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
> > +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > +    size_t cnt, pkt_cnt = dp_packet_batch_size(batch);
> > +
> > +    memset(stats, 0, sizeof *stats);
> > +
> > +    /* Copy dp-packets to mbufs. */
> > +    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
> > +        cnt = dpdk_copy_batch_to_mbuf(netdev, batch);
> > +        stats->tx_failure_drops += pkt_cnt - cnt;
> > +        pkt_cnt = cnt;
> >      }
> > +
> > +    /* Drop oversized packets. */
> > +    cnt = netdev_dpdk_filter_packet_len(dev, pkts, pkt_cnt);
> > +    stats->tx_mtu_exceeded_drops += pkt_cnt - cnt;
> > +    pkt_cnt = cnt;
> > +
> > +    /* Prepare each mbuf for hardware offloading. */
> > +    if (userspace_tso_enabled()) {
> > +        cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, pkt_cnt);
> > +        stats->tx_invalid_hwol_drops += pkt_cnt - cnt;
> > +        pkt_cnt = cnt;
> > +    }
> > +
> > +    /* Apply Quality of Service policy. */
> > +    cnt = netdev_dpdk_qos_run(dev, pkts, pkt_cnt, true);
> > +    stats->tx_qos_drops += pkt_cnt - cnt;
> > +
> > +    return cnt;
> >  }
> >  
> >  static int
> > @@ -2860,82 +2770,115 @@ netdev_dpdk_vhost_send(struct netdev
> > *netdev, int qid,
> >                         struct dp_packet_batch *batch,
> >                         bool concurrent_txq OVS_UNUSED)
> >  {
> > +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > +    int max_retries = VHOST_ENQ_RETRY_MIN;
> > +    int cnt, batch_cnt, vhost_batch_cnt;
> > +    int vid = netdev_dpdk_get_vid(dev);
> > +    struct netdev_dpdk_sw_stats stats;
> > +    struct rte_mbuf **pkts;
> > +    int retries;
> >  
> > -    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
> > -        dpdk_do_tx_copy(netdev, qid, batch);
> > +    batch_cnt = cnt = dp_packet_batch_size(batch);
> > +    qid = dev->tx_q[qid % netdev->n_txq].map;
> > +    if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
> > +                     || !(dev->flags & NETDEV_UP))) {
> > +        rte_spinlock_lock(&dev->stats_lock);
> > +        dev->stats.tx_dropped += cnt;
> > +        rte_spinlock_unlock(&dev->stats_lock);
> >          dp_packet_delete_batch(batch, true);
> > -    } else {
> > -        __netdev_dpdk_vhost_send(netdev, qid, batch->packets,
> > -                                 dp_packet_batch_size(batch));
> > +        return 0;
> > +    }
> > +
> > +    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
> > +
> > +    if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev-
> > >tx_q[qid].tx_lock))) {
> > +        COVERAGE_INC(vhost_tx_contention);
> > +        rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
> >      }
> > +
> > +    pkts = (struct rte_mbuf **) batch->packets;
> > +    vhost_batch_cnt = cnt;
> > +    retries = 0;
> > +    do {
> > +        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
> > +        int tx_pkts;
> > +
> > +        tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, pkts,
> > cnt);
> > +        if (OVS_LIKELY(tx_pkts)) {
> > +            /* Packets have been sent.*/
> > +            cnt -= tx_pkts;
> > +            /* Prepare for possible retry.*/
> > +            pkts = &pkts[tx_pkts];
> > +            if (OVS_UNLIKELY(cnt && !retries)) {
> > +                /*
> > +                 * Read max retries as there are packets not sent
> > +                 * and no retries have already occurred.
> > +                 */
> > +                atomic_read_relaxed(&dev->vhost_tx_retries_max,
> > &max_retries);
> > +            }
> > +        } else {
> > +            /* No packets sent - do not retry.*/
> > +            break;
> > +        }
> > +    } while (cnt && (retries++ < max_retries));
> > +
> > +    rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
> > +
> > +    stats.tx_failure_drops += cnt;
> > +    stats.tx_retries = MIN(retries, max_retries);
> > +
> > +    rte_spinlock_lock(&dev->stats_lock);
> > +    netdev_dpdk_vhost_update_tx_counters(dev, batch->packets,
> > batch_cnt,
> > +                                         &stats);
> > +    rte_spinlock_unlock(&dev->stats_lock);
> > +
> > +    pkts = (struct rte_mbuf **) batch->packets;
> > +    for (int i = 0; i < vhost_batch_cnt; i++) {
> > +        rte_pktmbuf_free(pkts[i]);
> > +    }
> > +
> >      return 0;
> >  }
> >  
> > -static inline void
> > -netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
> > -                   struct dp_packet_batch *batch,
> > -                   bool concurrent_txq)
> > +static int
> > +netdev_dpdk_eth_send(struct netdev *netdev, int qid,
> > +                     struct dp_packet_batch *batch, bool
> > concurrent_txq)
> >  {
> > +    struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
> > +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > +    int batch_cnt = dp_packet_batch_size(batch);
> > +    struct netdev_dpdk_sw_stats stats;
> > +    int cnt, dropped;
> > +
> >      if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
> >          dp_packet_delete_batch(batch, true);
> > -        return;
> > +        return 0;
> >      }
> >  
> > +    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
> > +    dropped = batch_cnt - cnt;
> >      if (OVS_UNLIKELY(concurrent_txq)) {
> >          qid = qid % dev->up.n_txq;
> >          rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
> >      }
> >  
> > -    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
> > -        struct netdev *netdev = &dev->up;
> > -
> > -        dpdk_do_tx_copy(netdev, qid, batch);
> > -        dp_packet_delete_batch(batch, true);
> > -    } else {
> > +    dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt);
> > +    if (OVS_UNLIKELY(dropped)) {
> >          struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> > -        int dropped;
> > -        int tx_failure, mtu_drops, qos_drops, hwol_drops;
> > -        int batch_cnt = dp_packet_batch_size(batch);
> > -        struct rte_mbuf **pkts = (struct rte_mbuf **) batch-
> > >packets;
> >  
> > -        hwol_drops = batch_cnt;
> > -        if (userspace_tso_enabled()) {
> > -            batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts,
> > batch_cnt);
> > -        }
> > -        hwol_drops -= batch_cnt;
> > -        mtu_drops = batch_cnt;
> > -        batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts,
> > batch_cnt);
> > -        mtu_drops -= batch_cnt;
> > -        qos_drops = batch_cnt;
> > -        batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true);
> > -        qos_drops -= batch_cnt;
> > -
> > -        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts,
> > batch_cnt);
> > -
> > -        dropped = tx_failure + mtu_drops + qos_drops + hwol_drops;
> > -        if (OVS_UNLIKELY(dropped)) {
> > -            rte_spinlock_lock(&dev->stats_lock);
> > -            dev->stats.tx_dropped += dropped;
> > -            sw_stats->tx_failure_drops += tx_failure;
> > -            sw_stats->tx_mtu_exceeded_drops += mtu_drops;
> > -            sw_stats->tx_qos_drops += qos_drops;
> > -            sw_stats->tx_invalid_hwol_drops += hwol_drops;
> > -            rte_spinlock_unlock(&dev->stats_lock);
> > -        }
> > +        rte_spinlock_lock(&dev->stats_lock);
> > +        dev->stats.tx_dropped += dropped;
> > +        sw_stats->tx_failure_drops += stats.tx_failure_drops;
> > +        sw_stats->tx_mtu_exceeded_drops +=
> > stats.tx_mtu_exceeded_drops;
> > +        sw_stats->tx_qos_drops += stats.tx_qos_drops;
> > +        sw_stats->tx_invalid_hwol_drops +=
> > stats.tx_invalid_hwol_drops;
> > +        rte_spinlock_unlock(&dev->stats_lock);
> >      }
> >  
> >      if (OVS_UNLIKELY(concurrent_txq)) {
> >          rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
> >      }
> > -}
> > -
> > -static int
> > -netdev_dpdk_eth_send(struct netdev *netdev, int qid,
> > -                     struct dp_packet_batch *batch, bool
> > concurrent_txq)
> > -{
> > -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >  
> > -    netdev_dpdk_send__(dev, qid, batch, concurrent_txq);
> >      return 0;
> >  }
> >  
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 2640a421a..a1437db4d 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2585,90 +2585,6 @@  netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev,
     }
 }
 
-static void
-__netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
-                         struct dp_packet **pkts, int cnt)
-{
-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-    struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
-    struct netdev_dpdk_sw_stats sw_stats_add;
-    unsigned int n_packets_to_free = cnt;
-    unsigned int total_packets = cnt;
-    int i, retries = 0;
-    int max_retries = VHOST_ENQ_RETRY_MIN;
-    int vid = netdev_dpdk_get_vid(dev);
-
-    qid = dev->tx_q[qid % netdev->n_txq].map;
-
-    if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
-                     || !(dev->flags & NETDEV_UP))) {
-        rte_spinlock_lock(&dev->stats_lock);
-        dev->stats.tx_dropped+= cnt;
-        rte_spinlock_unlock(&dev->stats_lock);
-        goto out;
-    }
-
-    if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) {
-        COVERAGE_INC(vhost_tx_contention);
-        rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
-    }
-
-    sw_stats_add.tx_invalid_hwol_drops = cnt;
-    if (userspace_tso_enabled()) {
-        cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
-    }
-
-    sw_stats_add.tx_invalid_hwol_drops -= cnt;
-    sw_stats_add.tx_mtu_exceeded_drops = cnt;
-    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
-    sw_stats_add.tx_mtu_exceeded_drops -= cnt;
-
-    /* Check has QoS has been configured for the netdev */
-    sw_stats_add.tx_qos_drops = cnt;
-    cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt, true);
-    sw_stats_add.tx_qos_drops -= cnt;
-
-    n_packets_to_free = cnt;
-
-    do {
-        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
-        unsigned int tx_pkts;
-
-        tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, cur_pkts, cnt);
-        if (OVS_LIKELY(tx_pkts)) {
-            /* Packets have been sent.*/
-            cnt -= tx_pkts;
-            /* Prepare for possible retry.*/
-            cur_pkts = &cur_pkts[tx_pkts];
-            if (OVS_UNLIKELY(cnt && !retries)) {
-                /*
-                 * Read max retries as there are packets not sent
-                 * and no retries have already occurred.
-                 */
-                atomic_read_relaxed(&dev->vhost_tx_retries_max, &max_retries);
-            }
-        } else {
-            /* No packets sent - do not retry.*/
-            break;
-        }
-    } while (cnt && (retries++ < max_retries));
-
-    rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
-
-    sw_stats_add.tx_failure_drops = cnt;
-    sw_stats_add.tx_retries = MIN(retries, max_retries);
-
-    rte_spinlock_lock(&dev->stats_lock);
-    netdev_dpdk_vhost_update_tx_counters(dev, pkts, total_packets,
-                                         &sw_stats_add);
-    rte_spinlock_unlock(&dev->stats_lock);
-
-out:
-    for (i = 0; i < n_packets_to_free; i++) {
-        dp_packet_delete(pkts[i]);
-    }
-}
-
 static void
 netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
 {
@@ -2783,76 +2699,70 @@  dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
     return pkt_dest;
 }
 
-/* Tx function. Transmit packets indefinitely */
-static void
-dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
+/* Replace packets in a 'batch' with their corresponding copies using
+ * DPDK memory.
+ *
+ * Returns the number of good packets in the batch. */
+static size_t
+dpdk_copy_batch_to_mbuf(struct netdev *netdev, struct dp_packet_batch *batch)
     OVS_NO_THREAD_SAFETY_ANALYSIS
 {
-    const size_t batch_cnt = dp_packet_batch_size(batch);
-#if !defined(__CHECKER__) && !defined(_WIN32)
-    const size_t PKT_ARRAY_SIZE = batch_cnt;
-#else
-    /* Sparse or MSVC doesn't like variable length array. */
-    enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
-#endif
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-    struct dp_packet *pkts[PKT_ARRAY_SIZE];
-    struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
-    uint32_t cnt = batch_cnt;
-    uint32_t dropped = 0;
-    uint32_t tx_failure = 0;
-    uint32_t mtu_drops = 0;
-    uint32_t qos_drops = 0;
-
-    if (dev->type != DPDK_DEV_VHOST) {
-        /* Check if QoS has been configured for this netdev. */
-        cnt = netdev_dpdk_qos_run(dev, (struct rte_mbuf **) batch->packets,
-                                  batch_cnt, false);
-        qos_drops = batch_cnt - cnt;
-    }
-
-    uint32_t txcnt = 0;
-
-    for (uint32_t i = 0; i < cnt; i++) {
-        struct dp_packet *packet = batch->packets[i];
-        uint32_t size = dp_packet_size(packet);
-
-        if (size > dev->max_packet_len
-            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
-            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
-                         dev->max_packet_len);
-            mtu_drops++;
-            continue;
-        }
+    size_t i, size = dp_packet_batch_size(batch);
+    struct dp_packet *packet;
 
-        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
-        if (OVS_UNLIKELY(!pkts[txcnt])) {
-            dropped = cnt - i;
-            break;
-        }
+    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
+        if (OVS_UNLIKELY(packet->source == DPBUF_DPDK)) {
+            dp_packet_batch_refill(batch, packet, i);
+        } else {
+            struct dp_packet *pktcopy;
 
-        txcnt++;
-    }
+            pktcopy = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
+            if (pktcopy) {
+                dp_packet_batch_refill(batch, pktcopy, i);
+            }
 
-    if (OVS_LIKELY(txcnt)) {
-        if (dev->type == DPDK_DEV_VHOST) {
-            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
-        } else {
-            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
-                                                   (struct rte_mbuf **)pkts,
-                                                   txcnt);
+            dp_packet_delete(packet);
         }
     }
 
-    dropped += qos_drops + mtu_drops + tx_failure;
-    if (OVS_UNLIKELY(dropped)) {
-        rte_spinlock_lock(&dev->stats_lock);
-        dev->stats.tx_dropped += dropped;
-        sw_stats->tx_failure_drops += tx_failure;
-        sw_stats->tx_mtu_exceeded_drops += mtu_drops;
-        sw_stats->tx_qos_drops += qos_drops;
-        rte_spinlock_unlock(&dev->stats_lock);
+    return dp_packet_batch_size(batch);
+}
+
+static size_t
+netdev_dpdk_common_send(struct netdev *netdev, struct dp_packet_batch *batch,
+                        struct netdev_dpdk_sw_stats *stats)
+{
+    struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    size_t cnt, pkt_cnt = dp_packet_batch_size(batch);
+
+    memset(stats, 0, sizeof *stats);
+
+    /* Copy dp-packets to mbufs. */
+    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
+        cnt = dpdk_copy_batch_to_mbuf(netdev, batch);
+        stats->tx_failure_drops += pkt_cnt - cnt;
+        pkt_cnt = cnt;
     }
+
+    /* Drop oversized packets. */
+    cnt = netdev_dpdk_filter_packet_len(dev, pkts, pkt_cnt);
+    stats->tx_mtu_exceeded_drops += pkt_cnt - cnt;
+    pkt_cnt = cnt;
+
+    /* Prepare each mbuf for hardware offloading. */
+    if (userspace_tso_enabled()) {
+        cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, pkt_cnt);
+        stats->tx_invalid_hwol_drops += pkt_cnt - cnt;
+        pkt_cnt = cnt;
+    }
+
+    /* Apply Quality of Service policy. */
+    cnt = netdev_dpdk_qos_run(dev, pkts, pkt_cnt, true);
+    stats->tx_qos_drops += pkt_cnt - cnt;
+
+    return cnt;
 }
 
 static int
@@ -2860,82 +2770,115 @@  netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
                        struct dp_packet_batch *batch,
                        bool concurrent_txq OVS_UNUSED)
 {
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int max_retries = VHOST_ENQ_RETRY_MIN;
+    int cnt, batch_cnt, vhost_batch_cnt;
+    int vid = netdev_dpdk_get_vid(dev);
+    struct netdev_dpdk_sw_stats stats;
+    struct rte_mbuf **pkts;
+    int retries;
 
-    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
-        dpdk_do_tx_copy(netdev, qid, batch);
+    batch_cnt = cnt = dp_packet_batch_size(batch);
+    qid = dev->tx_q[qid % netdev->n_txq].map;
+    if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
+                     || !(dev->flags & NETDEV_UP))) {
+        rte_spinlock_lock(&dev->stats_lock);
+        dev->stats.tx_dropped += cnt;
+        rte_spinlock_unlock(&dev->stats_lock);
         dp_packet_delete_batch(batch, true);
-    } else {
-        __netdev_dpdk_vhost_send(netdev, qid, batch->packets,
-                                 dp_packet_batch_size(batch));
+        return 0;
+    }
+
+    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
+
+    if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) {
+        COVERAGE_INC(vhost_tx_contention);
+        rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
     }
+
+    pkts = (struct rte_mbuf **) batch->packets;
+    vhost_batch_cnt = cnt;
+    retries = 0;
+    do {
+        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
+        int tx_pkts;
+
+        tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, pkts, cnt);
+        if (OVS_LIKELY(tx_pkts)) {
+            /* Packets have been sent.*/
+            cnt -= tx_pkts;
+            /* Prepare for possible retry.*/
+            pkts = &pkts[tx_pkts];
+            if (OVS_UNLIKELY(cnt && !retries)) {
+                /*
+                 * Read max retries as there are packets not sent
+                 * and no retries have already occurred.
+                 */
+                atomic_read_relaxed(&dev->vhost_tx_retries_max, &max_retries);
+            }
+        } else {
+            /* No packets sent - do not retry.*/
+            break;
+        }
+    } while (cnt && (retries++ < max_retries));
+
+    rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
+
+    stats.tx_failure_drops += cnt;
+    stats.tx_retries = MIN(retries, max_retries);
+
+    rte_spinlock_lock(&dev->stats_lock);
+    netdev_dpdk_vhost_update_tx_counters(dev, batch->packets, batch_cnt,
+                                         &stats);
+    rte_spinlock_unlock(&dev->stats_lock);
+
+    pkts = (struct rte_mbuf **) batch->packets;
+    for (int i = 0; i < vhost_batch_cnt; i++) {
+        rte_pktmbuf_free(pkts[i]);
+    }
+
     return 0;
 }
 
-static inline void
-netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
-                   struct dp_packet_batch *batch,
-                   bool concurrent_txq)
+static int
+netdev_dpdk_eth_send(struct netdev *netdev, int qid,
+                     struct dp_packet_batch *batch, bool concurrent_txq)
 {
+    struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int batch_cnt = dp_packet_batch_size(batch);
+    struct netdev_dpdk_sw_stats stats;
+    int cnt, dropped;
+
     if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
         dp_packet_delete_batch(batch, true);
-        return;
+        return 0;
     }
 
+    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
+    dropped = batch_cnt - cnt;
     if (OVS_UNLIKELY(concurrent_txq)) {
         qid = qid % dev->up.n_txq;
         rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
     }
 
-    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
-        struct netdev *netdev = &dev->up;
-
-        dpdk_do_tx_copy(netdev, qid, batch);
-        dp_packet_delete_batch(batch, true);
-    } else {
+    dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt);
+    if (OVS_UNLIKELY(dropped)) {
         struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
-        int dropped;
-        int tx_failure, mtu_drops, qos_drops, hwol_drops;
-        int batch_cnt = dp_packet_batch_size(batch);
-        struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
 
-        hwol_drops = batch_cnt;
-        if (userspace_tso_enabled()) {
-            batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
-        }
-        hwol_drops -= batch_cnt;
-        mtu_drops = batch_cnt;
-        batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
-        mtu_drops -= batch_cnt;
-        qos_drops = batch_cnt;
-        batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true);
-        qos_drops -= batch_cnt;
-
-        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, batch_cnt);
-
-        dropped = tx_failure + mtu_drops + qos_drops + hwol_drops;
-        if (OVS_UNLIKELY(dropped)) {
-            rte_spinlock_lock(&dev->stats_lock);
-            dev->stats.tx_dropped += dropped;
-            sw_stats->tx_failure_drops += tx_failure;
-            sw_stats->tx_mtu_exceeded_drops += mtu_drops;
-            sw_stats->tx_qos_drops += qos_drops;
-            sw_stats->tx_invalid_hwol_drops += hwol_drops;
-            rte_spinlock_unlock(&dev->stats_lock);
-        }
+        rte_spinlock_lock(&dev->stats_lock);
+        dev->stats.tx_dropped += dropped;
+        sw_stats->tx_failure_drops += stats.tx_failure_drops;
+        sw_stats->tx_mtu_exceeded_drops += stats.tx_mtu_exceeded_drops;
+        sw_stats->tx_qos_drops += stats.tx_qos_drops;
+        sw_stats->tx_invalid_hwol_drops += stats.tx_invalid_hwol_drops;
+        rte_spinlock_unlock(&dev->stats_lock);
     }
 
     if (OVS_UNLIKELY(concurrent_txq)) {
         rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
     }
-}
-
-static int
-netdev_dpdk_eth_send(struct netdev *netdev, int qid,
-                     struct dp_packet_batch *batch, bool concurrent_txq)
-{
-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 
-    netdev_dpdk_send__(dev, qid, batch, concurrent_txq);
     return 0;
 }