diff mbox series

[ovs-dev,v2] netdev-dpdk: Refactor the DPDK transmit path.

Message ID 20210110030505.325722-1-fbl@sysclose.org
State New
Headers show
Series [ovs-dev,v2] netdev-dpdk: Refactor the DPDK transmit path. | expand

Commit Message

Flavio Leitner Jan. 10, 2021, 3:05 a.m. UTC
This patch split out the common code between vhost and
dpdk transmit paths to shared functions to simplify the
code and fix an issue.

The issue is that the packet coming from non-DPDK device
and egressing on a DPDK device currently skips the hwol
preparation.

This also have the side effect of leaving only the dpdk
transmit code under the txq lock.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---

V2:
   - mentioned the tx lock change in the commit message.
   - fixed packet leak when copy fails.
   - moved pkt_cnt = cnt two lines up.

I tested the following scenarios with iperf and iperf -R
and noticed no performance regression:
          IPERF       VM-Bridge  1.02x
          IPERF           VM-VM  1.04x
          IPERF      VM-ExtHost  1.00x
          IPERF           VM-NS  1.01x
          IPERF  VM-VLAN-Bridge  1.03x
          IPERF      VM-VLAN-VM  1.03x
          IPERF VM-VLAN-ExtHost  1.01x
          IPERF      VM-VLAN-NS  1.02x
          IPERF        VM-V6-VM  1.03x
          IPERF   VM-V6-ExtHost  1.00x
          IPERF        VM-V6-NS  1.01x
        IPERF-R       VM-Bridge  1.01x
        IPERF-R           VM-VM  1.04x
        IPERF-R      VM-ExtHost  1.10x
        IPERF-R           VM-NS  1.01x
        IPERF-R  VM-VLAN-Bridge  1.03x
        IPERF-R      VM-VLAN-VM  1.02x
        IPERF-R VM-VLAN-ExtHost  1.08x
        IPERF-R      VM-VLAN-NS  1.02x
        IPERF-R        VM-V6-VM  1.00x
        IPERF-R   VM-V6-ExtHost  1.11x
        IPERF-R        VM-V6-NS  1.00x

Now using trex, 64byte packet, PVP:
Original: 3.6Mpps
  avg. packets per output batch: 32.00
  idle cycles: 0 (0.00%)
  avg cycles per packet: 304.92 (8331383020/27323150)
  avg processing cycles per packet: 304.92 (8331383020/27323150)

Patched: 3.6Mpps
  avg. packets per output batch: 32.00
  idle cycles: 0 (0.00%)
  avg cycles per packet: 304.08 (21875784116/71941516)
  avg processing cycles per packet: 304.08 (21875784116/71941516)

 lib/netdev-dpdk.c | 335 +++++++++++++++++++---------------------------
 1 file changed, 139 insertions(+), 196 deletions(-)

Comments

Pai G, Sunil Jan. 11, 2021, 3:49 p.m. UTC | #1
Hi Flavio,


> -----Original Message-----
> From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner
> Sent: Sunday, January 10, 2021 8:35 AM
> To: dev@openvswitch.org
> Cc: David Marchand <david.marchand@redhat.com>; Flavio Leitner
> <fbl@sysclose.org>
> Subject: [ovs-dev] [PATCH v2] netdev-dpdk: Refactor the DPDK transmit
> path.
> 
> This patch split out the common code between vhost and dpdk transmit
> paths to shared functions to simplify the code and fix an issue.
> 
> The issue is that the packet coming from non-DPDK device and egressing on a
> DPDK device currently skips the hwol preparation.
> 
> This also have the side effect of leaving only the dpdk transmit code under
> the txq lock.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
> 
> V2:
>    - mentioned the tx lock change in the commit message.
>    - fixed packet leak when copy fails.
>    - moved pkt_cnt = cnt two lines up.
> 
> I tested the following scenarios with iperf and iperf -R and noticed no
> performance regression:
>           IPERF       VM-Bridge  1.02x
>           IPERF           VM-VM  1.04x
>           IPERF      VM-ExtHost  1.00x
>           IPERF           VM-NS  1.01x
>           IPERF  VM-VLAN-Bridge  1.03x
>           IPERF      VM-VLAN-VM  1.03x
>           IPERF VM-VLAN-ExtHost  1.01x
>           IPERF      VM-VLAN-NS  1.02x
>           IPERF        VM-V6-VM  1.03x
>           IPERF   VM-V6-ExtHost  1.00x
>           IPERF        VM-V6-NS  1.01x
>         IPERF-R       VM-Bridge  1.01x
>         IPERF-R           VM-VM  1.04x
>         IPERF-R      VM-ExtHost  1.10x
>         IPERF-R           VM-NS  1.01x
>         IPERF-R  VM-VLAN-Bridge  1.03x
>         IPERF-R      VM-VLAN-VM  1.02x
>         IPERF-R VM-VLAN-ExtHost  1.08x
>         IPERF-R      VM-VLAN-NS  1.02x
>         IPERF-R        VM-V6-VM  1.00x
>         IPERF-R   VM-V6-ExtHost  1.11x
>         IPERF-R        VM-V6-NS  1.00x
> 
> Now using trex, 64byte packet, PVP:
> Original: 3.6Mpps
>   avg. packets per output batch: 32.00
>   idle cycles: 0 (0.00%)
>   avg cycles per packet: 304.92 (8331383020/27323150)
>   avg processing cycles per packet: 304.92 (8331383020/27323150)
> 
> Patched: 3.6Mpps
>   avg. packets per output batch: 32.00
>   idle cycles: 0 (0.00%)
>   avg cycles per packet: 304.08 (21875784116/71941516)
>   avg processing cycles per packet: 304.08 (21875784116/71941516)


Ran the following tests in VSPERF for QoS and they all pass:
ovsdpdk_qos_create_phy_port	
ovsdpdk_qos_delete_phy_port	
ovsdpdk_qos_create_vport	
ovsdpdk_qos_delete_vport	
ovsdpdk_qos_create_no_cir	
ovsdpdk_qos_create_no_cbs

For performance tests:
We see a slight dip in performance for physical ports ~1% in phy2phy_tput test in VSPERF with the patch(v1) for 64byte packet
and PVP test results are in line with your observations.
NIC: Fortville 10G X710


Performance with different traffic profiles when deployed with 32VM's +1M flows + vxlan enabled :
With burst mode : ~1% increase in performance
as compared to 
scatter mode: ~4% decrease in performance.

[snipped]

Thanks , 
Sunil
Intel

> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Flavio Leitner Jan. 11, 2021, 7:36 p.m. UTC | #2
Hi Sunil,

On Mon, Jan 11, 2021 at 03:49:56PM +0000, Pai G, Sunil wrote:
> Hi Flavio,
> 
> 
> > -----Original Message-----
> > From: dev <ovs-dev-bounces@openvswitch.org> On Behalf Of Flavio Leitner
> > Sent: Sunday, January 10, 2021 8:35 AM
> > To: dev@openvswitch.org
> > Cc: David Marchand <david.marchand@redhat.com>; Flavio Leitner
> > <fbl@sysclose.org>
> > Subject: [ovs-dev] [PATCH v2] netdev-dpdk: Refactor the DPDK transmit
> > path.
> > 
> > This patch split out the common code between vhost and dpdk transmit
> > paths to shared functions to simplify the code and fix an issue.
> > 
> > The issue is that the packet coming from non-DPDK device and egressing on a
> > DPDK device currently skips the hwol preparation.
> > 
> > This also have the side effect of leaving only the dpdk transmit code under
> > the txq lock.
> > 
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > ---
> > 
> > V2:
> >    - mentioned the tx lock change in the commit message.
> >    - fixed packet leak when copy fails.
> >    - moved pkt_cnt = cnt two lines up.
> > 
> > I tested the following scenarios with iperf and iperf -R and noticed no
> > performance regression:
> >           IPERF       VM-Bridge  1.02x
> >           IPERF           VM-VM  1.04x
> >           IPERF      VM-ExtHost  1.00x
> >           IPERF           VM-NS  1.01x
> >           IPERF  VM-VLAN-Bridge  1.03x
> >           IPERF      VM-VLAN-VM  1.03x
> >           IPERF VM-VLAN-ExtHost  1.01x
> >           IPERF      VM-VLAN-NS  1.02x
> >           IPERF        VM-V6-VM  1.03x
> >           IPERF   VM-V6-ExtHost  1.00x
> >           IPERF        VM-V6-NS  1.01x
> >         IPERF-R       VM-Bridge  1.01x
> >         IPERF-R           VM-VM  1.04x
> >         IPERF-R      VM-ExtHost  1.10x
> >         IPERF-R           VM-NS  1.01x
> >         IPERF-R  VM-VLAN-Bridge  1.03x
> >         IPERF-R      VM-VLAN-VM  1.02x
> >         IPERF-R VM-VLAN-ExtHost  1.08x
> >         IPERF-R      VM-VLAN-NS  1.02x
> >         IPERF-R        VM-V6-VM  1.00x
> >         IPERF-R   VM-V6-ExtHost  1.11x
> >         IPERF-R        VM-V6-NS  1.00x
> > 
> > Now using trex, 64byte packet, PVP:
> > Original: 3.6Mpps
> >   avg. packets per output batch: 32.00
> >   idle cycles: 0 (0.00%)
> >   avg cycles per packet: 304.92 (8331383020/27323150)
> >   avg processing cycles per packet: 304.92 (8331383020/27323150)
> > 
> > Patched: 3.6Mpps
> >   avg. packets per output batch: 32.00
> >   idle cycles: 0 (0.00%)
> >   avg cycles per packet: 304.08 (21875784116/71941516)
> >   avg processing cycles per packet: 304.08 (21875784116/71941516)
> 
> 
> Ran the following tests in VSPERF for QoS and they all pass:

Great, thanks for testing the patch.

> ovsdpdk_qos_create_phy_port	
> ovsdpdk_qos_delete_phy_port	
> ovsdpdk_qos_create_vport	
> ovsdpdk_qos_delete_vport	
> ovsdpdk_qos_create_no_cir	
> ovsdpdk_qos_create_no_cbs
> 
> For performance tests:
> We see a slight dip in performance for physical ports ~1% in phy2phy_tput test in VSPERF with the patch(v1) for 64byte packet

I tried P2P here (v2) with trex, single flow, single queue,
single PMD and I got this:

original: 12.88Mpps/6.6Gbps
  idle cycles: 0 (0.00%)
  processing cycles: 14466174204 (100.00%)
  avg cycles per packet: 170.73 (14466174204/84730796)
  avg processing cycles per packet: 170.73 (14466174204/84730796)


patched: 13.00Mpps/6.65Mpps
  idle cycles: 0 (0.00%)
  processing cycles: 3783269716 (100.00%)
  avg cycles per packet: 169.49 (3783269716/22321968)
  avg processing cycles per packet: 169.49 (3783269716/22321968)

So, I see a small performance gain. Can you provide a more detailed
test description? I wonder if we are testing the same thing.

> and PVP test results are in line with your observations.
> NIC: Fortville 10G X710

Great.

> Performance with different traffic profiles when deployed with 32VM's +1M flows + vxlan enabled :
> With burst mode : ~1% increase in performance

That is aligned with above.

> as compared to 
> scatter mode: ~4% decrease in performance.

Hm, that is unexpected. Is this reproducible?
How are you switching between burst and scatter mode?

Thanks,
fbl
David Marchand Jan. 12, 2021, 7:57 a.m. UTC | #3
On Sun, Jan 10, 2021 at 4:05 AM Flavio Leitner <fbl@sysclose.org> wrote:
>
> This patch split out the common code between vhost and
> dpdk transmit paths to shared functions to simplify the
> code and fix an issue.
>
> The issue is that the packet coming from non-DPDK device
> and egressing on a DPDK device currently skips the hwol
> preparation.
>
> This also have the side effect of leaving only the dpdk

nit: has*

> transmit code under the txq lock.
>
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>

Reviewed-by: David Marchand <david.marchand@redhat.com>

Thanks.
Kovacevic, Marko Jan. 13, 2021, 10:51 a.m. UTC | #4
Hi Flavio,

> So, I see a small performance gain. Can you provide a more detailed
> test description? I wonder if we are testing the same thing.

VSPERF phy2phy_tput uses 2 dpdk pmds and performs zero packet loss test(RFC2544) with bidirectional traffic.

> 
> > and PVP test results are in line with your observations.
> > NIC: Fortville 10G X710
> 
> Great.
> 
> > Performance with different traffic profiles when deployed with 32VM's
> +1M flows + vxlan enabled :
> > With burst mode : ~1% increase in performance
> 
> That is aligned with above.
> 
> > as compared to
> > scatter mode: ~4% decrease in performance.
> 
> Hm, that is unexpected. Is this reproducible?

If you mean are the results reproducible then yes, 
I done a re-run of v1 and v2 again 

I ran my test against these commits to get baseline results and then ran v1 and v2 and compared results
DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0
OVS: 7f79ae2fb (HEAD -> master, origin/master, origin/HEAD) Documentation: Simplify the website main page.

V1:
DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0
OVS: 957132b3a (HEAD -> master) netdev-dpdk: Refactor the DPDK transmit path.
7f79ae2fb (origin/master, origin/HEAD) Documentation: Simplify the website main page.

V1: Burst= 1%
V1: Scatter -4%
As sunil reported already shows the same results again

As for V2 it didn't show any increase or decrease for both burst and scatter, 
And even between the v1 and v2 results from my second run it didn't show much difference on my test anyway.

> How are you switching between burst and scatter mode?

As for how were switching between burst and scatter, I just use two different traffic profiles & gather results.
So I run the test using burst profile restart the test then use scatter.

Traffic @ Phy NIC Rx:
Ether()/IP()/UDP()/VXLAN()/Ether()/IP()

Burst: on the outer ip we do a burst of 32 packets with same ip then switch for next 32 and so on 
Scatter: for scatter we do incrementally for 32 

And we do this for 1M flows

I hope this answers your question
> 
> Thanks,
> fbl

Thanks,
Marko K
Flavio Leitner Jan. 13, 2021, 6:01 p.m. UTC | #5
Hi Marko,

On Wed, Jan 13, 2021 at 10:51:03AM +0000, Kovacevic, Marko wrote:
> Hi Flavio,
> 
> > So, I see a small performance gain. Can you provide a more detailed
> > test description? I wonder if we are testing the same thing.
> 
> VSPERF phy2phy_tput uses 2 dpdk pmds and performs zero packet loss test(RFC2544) with bidirectional traffic.

Ok.

> > > and PVP test results are in line with your observations.
> > > NIC: Fortville 10G X710
> > 
> > Great.
> > 
> > > Performance with different traffic profiles when deployed with 32VM's
> > +1M flows + vxlan enabled :
> > > With burst mode : ~1% increase in performance
> > 
> > That is aligned with above.
> > 
> > > as compared to
> > > scatter mode: ~4% decrease in performance.
> > 
> > Hm, that is unexpected. Is this reproducible?
> 
> If you mean are the results reproducible then yes, 
> I done a re-run of v1 and v2 again 

Thanks for checking.

> I ran my test against these commits to get baseline results and then ran v1 and v2 and compared results
> DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0
> OVS: 7f79ae2fb (HEAD -> master, origin/master, origin/HEAD) Documentation: Simplify the website main page.
> 
> V1:
> DPDK: b1d36cf82 (HEAD, tag: v20.11, origin/releases) version: 20.11.0
> OVS: 957132b3a (HEAD -> master) netdev-dpdk: Refactor the DPDK transmit path.
> 7f79ae2fb (origin/master, origin/HEAD) Documentation: Simplify the website main page.
> 
> V1: Burst= 1%
> V1: Scatter -4%
> As sunil reported already shows the same results again

Alright. I will try to reproduce in my lab to understand the root
cause.


> As for V2 it didn't show any increase or decrease for both burst and scatter, 
> And even between the v1 and v2 results from my second run it didn't show much difference on my test anyway.
> 
> > How are you switching between burst and scatter mode?
> 
> As for how were switching between burst and scatter, I just use two different traffic profiles & gather results.
> So I run the test using burst profile restart the test then use scatter.
> 
> Traffic @ Phy NIC Rx:
> Ether()/IP()/UDP()/VXLAN()/Ether()/IP()
> 
> Burst: on the outer ip we do a burst of 32 packets with same ip then switch for next 32 and so on 
> Scatter: for scatter we do incrementally for 32 

Glad that I asked because I thought you were talking about the
NIC mode with scatter enabled or not.

> And we do this for 1M flows
> 
> I hope this answers your question

They are helpful. I will try to reproduce the -4% results in my
lab as a next step. It may take some time.

Thanks for testing and providing feedbacks!
diff mbox series

Patch

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 2640a421a..a1437db4d 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2585,90 +2585,6 @@  netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev,
     }
 }
 
-static void
-__netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
-                         struct dp_packet **pkts, int cnt)
-{
-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-    struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
-    struct netdev_dpdk_sw_stats sw_stats_add;
-    unsigned int n_packets_to_free = cnt;
-    unsigned int total_packets = cnt;
-    int i, retries = 0;
-    int max_retries = VHOST_ENQ_RETRY_MIN;
-    int vid = netdev_dpdk_get_vid(dev);
-
-    qid = dev->tx_q[qid % netdev->n_txq].map;
-
-    if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
-                     || !(dev->flags & NETDEV_UP))) {
-        rte_spinlock_lock(&dev->stats_lock);
-        dev->stats.tx_dropped+= cnt;
-        rte_spinlock_unlock(&dev->stats_lock);
-        goto out;
-    }
-
-    if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) {
-        COVERAGE_INC(vhost_tx_contention);
-        rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
-    }
-
-    sw_stats_add.tx_invalid_hwol_drops = cnt;
-    if (userspace_tso_enabled()) {
-        cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
-    }
-
-    sw_stats_add.tx_invalid_hwol_drops -= cnt;
-    sw_stats_add.tx_mtu_exceeded_drops = cnt;
-    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
-    sw_stats_add.tx_mtu_exceeded_drops -= cnt;
-
-    /* Check has QoS has been configured for the netdev */
-    sw_stats_add.tx_qos_drops = cnt;
-    cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt, true);
-    sw_stats_add.tx_qos_drops -= cnt;
-
-    n_packets_to_free = cnt;
-
-    do {
-        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
-        unsigned int tx_pkts;
-
-        tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, cur_pkts, cnt);
-        if (OVS_LIKELY(tx_pkts)) {
-            /* Packets have been sent.*/
-            cnt -= tx_pkts;
-            /* Prepare for possible retry.*/
-            cur_pkts = &cur_pkts[tx_pkts];
-            if (OVS_UNLIKELY(cnt && !retries)) {
-                /*
-                 * Read max retries as there are packets not sent
-                 * and no retries have already occurred.
-                 */
-                atomic_read_relaxed(&dev->vhost_tx_retries_max, &max_retries);
-            }
-        } else {
-            /* No packets sent - do not retry.*/
-            break;
-        }
-    } while (cnt && (retries++ < max_retries));
-
-    rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
-
-    sw_stats_add.tx_failure_drops = cnt;
-    sw_stats_add.tx_retries = MIN(retries, max_retries);
-
-    rte_spinlock_lock(&dev->stats_lock);
-    netdev_dpdk_vhost_update_tx_counters(dev, pkts, total_packets,
-                                         &sw_stats_add);
-    rte_spinlock_unlock(&dev->stats_lock);
-
-out:
-    for (i = 0; i < n_packets_to_free; i++) {
-        dp_packet_delete(pkts[i]);
-    }
-}
-
 static void
 netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
 {
@@ -2783,76 +2699,70 @@  dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
     return pkt_dest;
 }
 
-/* Tx function. Transmit packets indefinitely */
-static void
-dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
+/* Replace packets in a 'batch' with their corresponding copies using
+ * DPDK memory.
+ *
+ * Returns the number of good packets in the batch. */
+static size_t
+dpdk_copy_batch_to_mbuf(struct netdev *netdev, struct dp_packet_batch *batch)
     OVS_NO_THREAD_SAFETY_ANALYSIS
 {
-    const size_t batch_cnt = dp_packet_batch_size(batch);
-#if !defined(__CHECKER__) && !defined(_WIN32)
-    const size_t PKT_ARRAY_SIZE = batch_cnt;
-#else
-    /* Sparse or MSVC doesn't like variable length array. */
-    enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
-#endif
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-    struct dp_packet *pkts[PKT_ARRAY_SIZE];
-    struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
-    uint32_t cnt = batch_cnt;
-    uint32_t dropped = 0;
-    uint32_t tx_failure = 0;
-    uint32_t mtu_drops = 0;
-    uint32_t qos_drops = 0;
-
-    if (dev->type != DPDK_DEV_VHOST) {
-        /* Check if QoS has been configured for this netdev. */
-        cnt = netdev_dpdk_qos_run(dev, (struct rte_mbuf **) batch->packets,
-                                  batch_cnt, false);
-        qos_drops = batch_cnt - cnt;
-    }
-
-    uint32_t txcnt = 0;
-
-    for (uint32_t i = 0; i < cnt; i++) {
-        struct dp_packet *packet = batch->packets[i];
-        uint32_t size = dp_packet_size(packet);
-
-        if (size > dev->max_packet_len
-            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
-            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
-                         dev->max_packet_len);
-            mtu_drops++;
-            continue;
-        }
+    size_t i, size = dp_packet_batch_size(batch);
+    struct dp_packet *packet;
 
-        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
-        if (OVS_UNLIKELY(!pkts[txcnt])) {
-            dropped = cnt - i;
-            break;
-        }
+    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
+        if (OVS_UNLIKELY(packet->source == DPBUF_DPDK)) {
+            dp_packet_batch_refill(batch, packet, i);
+        } else {
+            struct dp_packet *pktcopy;
 
-        txcnt++;
-    }
+            pktcopy = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
+            if (pktcopy) {
+                dp_packet_batch_refill(batch, pktcopy, i);
+            }
 
-    if (OVS_LIKELY(txcnt)) {
-        if (dev->type == DPDK_DEV_VHOST) {
-            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
-        } else {
-            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
-                                                   (struct rte_mbuf **)pkts,
-                                                   txcnt);
+            dp_packet_delete(packet);
         }
     }
 
-    dropped += qos_drops + mtu_drops + tx_failure;
-    if (OVS_UNLIKELY(dropped)) {
-        rte_spinlock_lock(&dev->stats_lock);
-        dev->stats.tx_dropped += dropped;
-        sw_stats->tx_failure_drops += tx_failure;
-        sw_stats->tx_mtu_exceeded_drops += mtu_drops;
-        sw_stats->tx_qos_drops += qos_drops;
-        rte_spinlock_unlock(&dev->stats_lock);
+    return dp_packet_batch_size(batch);
+}
+
+static size_t
+netdev_dpdk_common_send(struct netdev *netdev, struct dp_packet_batch *batch,
+                        struct netdev_dpdk_sw_stats *stats)
+{
+    struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    size_t cnt, pkt_cnt = dp_packet_batch_size(batch);
+
+    memset(stats, 0, sizeof *stats);
+
+    /* Copy dp-packets to mbufs. */
+    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
+        cnt = dpdk_copy_batch_to_mbuf(netdev, batch);
+        stats->tx_failure_drops += pkt_cnt - cnt;
+        pkt_cnt = cnt;
     }
+
+    /* Drop oversized packets. */
+    cnt = netdev_dpdk_filter_packet_len(dev, pkts, pkt_cnt);
+    stats->tx_mtu_exceeded_drops += pkt_cnt - cnt;
+    pkt_cnt = cnt;
+
+    /* Prepare each mbuf for hardware offloading. */
+    if (userspace_tso_enabled()) {
+        cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, pkt_cnt);
+        stats->tx_invalid_hwol_drops += pkt_cnt - cnt;
+        pkt_cnt = cnt;
+    }
+
+    /* Apply Quality of Service policy. */
+    cnt = netdev_dpdk_qos_run(dev, pkts, pkt_cnt, true);
+    stats->tx_qos_drops += pkt_cnt - cnt;
+
+    return cnt;
 }
 
 static int
@@ -2860,82 +2770,115 @@  netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
                        struct dp_packet_batch *batch,
                        bool concurrent_txq OVS_UNUSED)
 {
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int max_retries = VHOST_ENQ_RETRY_MIN;
+    int cnt, batch_cnt, vhost_batch_cnt;
+    int vid = netdev_dpdk_get_vid(dev);
+    struct netdev_dpdk_sw_stats stats;
+    struct rte_mbuf **pkts;
+    int retries;
 
-    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
-        dpdk_do_tx_copy(netdev, qid, batch);
+    batch_cnt = cnt = dp_packet_batch_size(batch);
+    qid = dev->tx_q[qid % netdev->n_txq].map;
+    if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
+                     || !(dev->flags & NETDEV_UP))) {
+        rte_spinlock_lock(&dev->stats_lock);
+        dev->stats.tx_dropped += cnt;
+        rte_spinlock_unlock(&dev->stats_lock);
         dp_packet_delete_batch(batch, true);
-    } else {
-        __netdev_dpdk_vhost_send(netdev, qid, batch->packets,
-                                 dp_packet_batch_size(batch));
+        return 0;
+    }
+
+    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
+
+    if (OVS_UNLIKELY(!rte_spinlock_trylock(&dev->tx_q[qid].tx_lock))) {
+        COVERAGE_INC(vhost_tx_contention);
+        rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
     }
+
+    pkts = (struct rte_mbuf **) batch->packets;
+    vhost_batch_cnt = cnt;
+    retries = 0;
+    do {
+        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
+        int tx_pkts;
+
+        tx_pkts = rte_vhost_enqueue_burst(vid, vhost_qid, pkts, cnt);
+        if (OVS_LIKELY(tx_pkts)) {
+            /* Packets have been sent.*/
+            cnt -= tx_pkts;
+            /* Prepare for possible retry.*/
+            pkts = &pkts[tx_pkts];
+            if (OVS_UNLIKELY(cnt && !retries)) {
+                /*
+                 * Read max retries as there are packets not sent
+                 * and no retries have already occurred.
+                 */
+                atomic_read_relaxed(&dev->vhost_tx_retries_max, &max_retries);
+            }
+        } else {
+            /* No packets sent - do not retry.*/
+            break;
+        }
+    } while (cnt && (retries++ < max_retries));
+
+    rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
+
+    stats.tx_failure_drops += cnt;
+    stats.tx_retries = MIN(retries, max_retries);
+
+    rte_spinlock_lock(&dev->stats_lock);
+    netdev_dpdk_vhost_update_tx_counters(dev, batch->packets, batch_cnt,
+                                         &stats);
+    rte_spinlock_unlock(&dev->stats_lock);
+
+    pkts = (struct rte_mbuf **) batch->packets;
+    for (int i = 0; i < vhost_batch_cnt; i++) {
+        rte_pktmbuf_free(pkts[i]);
+    }
+
     return 0;
 }
 
-static inline void
-netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
-                   struct dp_packet_batch *batch,
-                   bool concurrent_txq)
+static int
+netdev_dpdk_eth_send(struct netdev *netdev, int qid,
+                     struct dp_packet_batch *batch, bool concurrent_txq)
 {
+    struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int batch_cnt = dp_packet_batch_size(batch);
+    struct netdev_dpdk_sw_stats stats;
+    int cnt, dropped;
+
     if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
         dp_packet_delete_batch(batch, true);
-        return;
+        return 0;
     }
 
+    cnt = netdev_dpdk_common_send(netdev, batch, &stats);
+    dropped = batch_cnt - cnt;
     if (OVS_UNLIKELY(concurrent_txq)) {
         qid = qid % dev->up.n_txq;
         rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
     }
 
-    if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
-        struct netdev *netdev = &dev->up;
-
-        dpdk_do_tx_copy(netdev, qid, batch);
-        dp_packet_delete_batch(batch, true);
-    } else {
+    dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt);
+    if (OVS_UNLIKELY(dropped)) {
         struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
-        int dropped;
-        int tx_failure, mtu_drops, qos_drops, hwol_drops;
-        int batch_cnt = dp_packet_batch_size(batch);
-        struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
 
-        hwol_drops = batch_cnt;
-        if (userspace_tso_enabled()) {
-            batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
-        }
-        hwol_drops -= batch_cnt;
-        mtu_drops = batch_cnt;
-        batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
-        mtu_drops -= batch_cnt;
-        qos_drops = batch_cnt;
-        batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true);
-        qos_drops -= batch_cnt;
-
-        tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, batch_cnt);
-
-        dropped = tx_failure + mtu_drops + qos_drops + hwol_drops;
-        if (OVS_UNLIKELY(dropped)) {
-            rte_spinlock_lock(&dev->stats_lock);
-            dev->stats.tx_dropped += dropped;
-            sw_stats->tx_failure_drops += tx_failure;
-            sw_stats->tx_mtu_exceeded_drops += mtu_drops;
-            sw_stats->tx_qos_drops += qos_drops;
-            sw_stats->tx_invalid_hwol_drops += hwol_drops;
-            rte_spinlock_unlock(&dev->stats_lock);
-        }
+        rte_spinlock_lock(&dev->stats_lock);
+        dev->stats.tx_dropped += dropped;
+        sw_stats->tx_failure_drops += stats.tx_failure_drops;
+        sw_stats->tx_mtu_exceeded_drops += stats.tx_mtu_exceeded_drops;
+        sw_stats->tx_qos_drops += stats.tx_qos_drops;
+        sw_stats->tx_invalid_hwol_drops += stats.tx_invalid_hwol_drops;
+        rte_spinlock_unlock(&dev->stats_lock);
     }
 
     if (OVS_UNLIKELY(concurrent_txq)) {
         rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
     }
-}
-
-static int
-netdev_dpdk_eth_send(struct netdev *netdev, int qid,
-                     struct dp_packet_batch *batch, bool concurrent_txq)
-{
-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 
-    netdev_dpdk_send__(dev, qid, batch, concurrent_txq);
     return 0;
 }