diff mbox series

[ovs-dev,v5,1/2] netdev-dpdk: Add per virtqueue statistics.

Message ID 20221219150306.20839-1-david.marchand@redhat.com
State Superseded
Headers show
Series [ovs-dev,v5,1/2] netdev-dpdk: Add per virtqueue statistics. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

David Marchand Dec. 19, 2022, 3:03 p.m. UTC
The DPDK vhost-user library maintains more granular per queue stats
which can replace what OVS was providing for vhost-user ports.

The benefits for OVS:
- OVS can skip parsing packet sizes on the rx side,
- vhost-user is aware of which packets are transmitted to the guest,
  so per *transmitted* packet size stats can be reported,
- more internal stats from vhost-user may be exposed, without OVS
  needing to understand them,

Note: the vhost-user library does not provide global stats for a port.
The proposed implementation is to have the global stats (exposed via
netdev_get_stats()) computed by querying and aggregating all per queue
stats.
Since per queue stats are exposed via another netdev ops
(netdev_get_custom_stats()), this may lead to some race and small
discrepancies.
This issue might already affect other netdev classes.

Example:
$ ovs-vsctl get interface vhost4 statistics |
  sed -e 's#[{}]##g' -e 's#, #\n#g' |
  grep -v =0$
rx_1_to_64_packets=12
rx_256_to_511_packets=15
rx_65_to_127_packets=21
rx_broadcast_packets=15
rx_bytes=7497
rx_multicast_packets=33
rx_packets=48
rx_q0_good_bytes=242
rx_q0_good_packets=3
rx_q0_guest_notifications=3
rx_q0_multicast_packets=3
rx_q0_size_65_127_packets=2
rx_q0_undersize_packets=1
rx_q1_broadcast_packets=15
rx_q1_good_bytes=7255
rx_q1_good_packets=45
rx_q1_guest_notifications=45
rx_q1_multicast_packets=30
rx_q1_size_256_511_packets=15
rx_q1_size_65_127_packets=19
rx_q1_undersize_packets=11
tx_1_to_64_packets=36
tx_256_to_511_packets=45
tx_65_to_127_packets=63
tx_broadcast_packets=45
tx_bytes=22491
tx_multicast_packets=99
tx_packets=144
tx_q0_broadcast_packets=30
tx_q0_good_bytes=14994
tx_q0_good_packets=96
tx_q0_guest_notifications=96
tx_q0_multicast_packets=66
tx_q0_size_256_511_packets=30
tx_q0_size_65_127_packets=42
tx_q0_undersize_packets=24
tx_q1_broadcast_packets=15
tx_q1_good_bytes=7497
tx_q1_good_packets=48
tx_q1_guest_notifications=48
tx_q1_multicast_packets=33
tx_q1_size_256_511_packets=15
tx_q1_size_65_127_packets=21
tx_q1_undersize_packets=12

Signed-off-by: David Marchand <david.marchand@redhat.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
Changes since v3:
- rebased to master now that v22.11 landed,
- fixed error code in stats helper when vhost port is not "running",
- shortened rx/tx stats macro names,

Changes since RFC v2:
- dropped the experimental api check (now that the feature is marked
  stable in DPDK),
- moved netdev_dpdk_get_carrier() forward declaration next to the
  function needing it,
- used per q stats for netdev_get_stats() and removed OVS per packet
  size accounting logic,
- fixed small packets counter (see rx_undersized_errors hack),
- added more Tx stats,
- added unit tests,

---
 lib/netdev-dpdk.c    | 398 ++++++++++++++++++++++++++++++++-----------
 tests/system-dpdk.at |  33 +++-
 2 files changed, 332 insertions(+), 99 deletions(-)

Comments

Ilya Maximets Dec. 21, 2022, 11:18 p.m. UTC | #1
On 12/19/22 16:03, David Marchand wrote:
> The DPDK vhost-user library maintains more granular per queue stats
> which can replace what OVS was providing for vhost-user ports.
> 
> The benefits for OVS:
> - OVS can skip parsing packet sizes on the rx side,
> - vhost-user is aware of which packets are transmitted to the guest,
>   so per *transmitted* packet size stats can be reported,
> - more internal stats from vhost-user may be exposed, without OVS
>   needing to understand them,

Hi, David and Maxime.  Thanks for the patch!

The change looks good to me in general.  I would like to avoid some
of the code duplication it introduces, but I'm not sure how to actually
do that, so it's fine as is.

However, while testing the change I see a noticeable performance
degradation in a simple V-to-V scenario with testpmd and virtio-user
ports.  Performance dips by 1-3%.  It looks like the code added
to DPDK for some reason a bit heavier than code removed from OVS,
so they do not even out the packet rate with the current master.

If I'm commenting out enabling of the RTE_VHOST_USER_NET_STATS_ENABLE
flag, I can get performance back.  And it's even a bit higher, but
not high enough to compensate for stats accounting in vhost library
if stats are enabled.

Could you, please, check?  Maybe I'm missing something.

Best regards, Ilya Maximets.
David Marchand Jan. 5, 2023, 1:31 p.m. UTC | #2
Hi Ilya,

On Thu, Dec 22, 2022 at 12:18 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 12/19/22 16:03, David Marchand wrote:
> > The DPDK vhost-user library maintains more granular per queue stats
> > which can replace what OVS was providing for vhost-user ports.
> >
> > The benefits for OVS:
> > - OVS can skip parsing packet sizes on the rx side,
> > - vhost-user is aware of which packets are transmitted to the guest,
> >   so per *transmitted* packet size stats can be reported,
> > - more internal stats from vhost-user may be exposed, without OVS
> >   needing to understand them,
>
> Hi, David and Maxime.  Thanks for the patch!
>
> The change looks good to me in general.  I would like to avoid some
> of the code duplication it introduces, but I'm not sure how to actually
> do that, so it's fine as is.
>
> However, while testing the change I see a noticeable performance
> degradation in a simple V-to-V scenario with testpmd and virtio-user
> ports.  Performance dips by 1-3%.  It looks like the code added
> to DPDK for some reason a bit heavier than code removed from OVS,
> so they do not even out the packet rate with the current master.
>
> If I'm commenting out enabling of the RTE_VHOST_USER_NET_STATS_ENABLE
> flag, I can get performance back.  And it's even a bit higher, but
> not high enough to compensate for stats accounting in vhost library
> if stats are enabled.

Sorry it took me a while but I can't see such difference in numbers.


I did some experiments with a simple mono directionnal setup like:
virtio-user testpmd/txonly --> ovs --> virtio-user testpmd/rxonly
I had to restart testpmd between each restart of OVS: it looks like
virtio-user (as server) reconnection has some bug.

I took care to dedicate one HT by datapath thread (i.e. all physical
cores are isolated, and ovs pmd thread runs on a HT whose sibling is
idle. Same for testpmd rx / tx threads).
I rely on testpmd tx stats, grabing stats for 10 runs of a few seconds.

When OVS drops everything (no fwd to other port, which means we are
only counting received packets from OVS pov):
- master: 12.189Mpps
- series: 12.226Mpps (+0.3%)

I also tried with enabling/disabling stats support in vhost library
(+stats means adding RTE_VHOST_USER_NET_STATS_ENABLE, resp. -stats
means removing it):
- master+stats: 11.962Mpps (-1.9%)
- series-stats: 12.453Mpps (+2.1%)


When OVS fwds (which should be what you tested, correct?):
- master: 7.830Mpps
- series: 7.795Mpps (-0.5%)

- master+stats: 7.641Mpps (-2.5%)
- series-stats: 7.967Mpps (+1.7%)


I noticed though that if ovs pmd thread runs on a HT whose sibling is
busy polling a physical port, performance numbers are far more
unstable, and I can see differences up to +/- 5%, regardless of code
changes.
Ilya Maximets Jan. 5, 2023, 1:59 p.m. UTC | #3
On 1/5/23 14:31, David Marchand wrote:
> Hi Ilya,
> 
> On Thu, Dec 22, 2022 at 12:18 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 12/19/22 16:03, David Marchand wrote:
>>> The DPDK vhost-user library maintains more granular per queue stats
>>> which can replace what OVS was providing for vhost-user ports.
>>>
>>> The benefits for OVS:
>>> - OVS can skip parsing packet sizes on the rx side,
>>> - vhost-user is aware of which packets are transmitted to the guest,
>>>   so per *transmitted* packet size stats can be reported,
>>> - more internal stats from vhost-user may be exposed, without OVS
>>>   needing to understand them,
>>
>> Hi, David and Maxime.  Thanks for the patch!
>>
>> The change looks good to me in general.  I would like to avoid some
>> of the code duplication it introduces, but I'm not sure how to actually
>> do that, so it's fine as is.
>>
>> However, while testing the change I see a noticeable performance
>> degradation in a simple V-to-V scenario with testpmd and virtio-user
>> ports.  Performance dips by 1-3%.  It looks like the code added
>> to DPDK for some reason a bit heavier than code removed from OVS,
>> so they do not even out the packet rate with the current master.
>>
>> If I'm commenting out enabling of the RTE_VHOST_USER_NET_STATS_ENABLE
>> flag, I can get performance back.  And it's even a bit higher, but
>> not high enough to compensate for stats accounting in vhost library
>> if stats are enabled.
> 
> Sorry it took me a while but I can't see such difference in numbers.
> 
> 
> I did some experiments with a simple mono directionnal setup like:
> virtio-user testpmd/txonly --> ovs --> virtio-user testpmd/rxonly
> I had to restart testpmd between each restart of OVS: it looks like
> virtio-user (as server) reconnection has some bug.
> 
> I took care to dedicate one HT by datapath thread (i.e. all physical
> cores are isolated, and ovs pmd thread runs on a HT whose sibling is
> idle. Same for testpmd rx / tx threads).
> I rely on testpmd tx stats, grabing stats for 10 runs of a few seconds.
> 
> When OVS drops everything (no fwd to other port, which means we are
> only counting received packets from OVS pov):
> - master: 12.189Mpps
> - series: 12.226Mpps (+0.3%)
> 
> I also tried with enabling/disabling stats support in vhost library
> (+stats means adding RTE_VHOST_USER_NET_STATS_ENABLE, resp. -stats
> means removing it):
> - master+stats: 11.962Mpps (-1.9%)
> - series-stats: 12.453Mpps (+2.1%)
> 
> 
> When OVS fwds (which should be what you tested, correct?):
> - master: 7.830Mpps
> - series: 7.795Mpps (-0.5%)
> 
> - master+stats: 7.641Mpps (-2.5%)
> - series-stats: 7.967Mpps (+1.7%)


My setup is a bit different, because it's bidirectional.  I have 2
testpmd apps.  One in txonly mode, the other is in mac mode.
On OVS side I have 2 PMD threads.  Ports have a single queue pinned
to its own PMD thread.  All the used cores are on the same NUMA, but
not siblings.  Having 2 PMD threads doing RX and TX on the same port
is an important factor, because of the stats_lock contention.

My numbers are:

actions:NORMAL
  ---
  master: 7.496 Mpps
  series: 7.383 Mpps (-1.5%)

in_port=1,actions:2
in_port=2,actions:1
  ---
  master: 8.296 Mpps
  series: 8.072 Mpps (-2.7%)

> 
> I noticed though that if ovs pmd thread runs on a HT whose sibling is
> busy polling a physical port, performance numbers are far more
> unstable, and I can see differences up to +/- 5%, regardless of code
> changes.
David Marchand Jan. 5, 2023, 4:54 p.m. UTC | #4
On Thu, Jan 5, 2023 at 2:59 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > I did some experiments with a simple mono directionnal setup like:
> > virtio-user testpmd/txonly --> ovs --> virtio-user testpmd/rxonly
> > I had to restart testpmd between each restart of OVS: it looks like
> > virtio-user (as server) reconnection has some bug.
> >
> > I took care to dedicate one HT by datapath thread (i.e. all physical
> > cores are isolated, and ovs pmd thread runs on a HT whose sibling is
> > idle. Same for testpmd rx / tx threads).
> > I rely on testpmd tx stats, grabing stats for 10 runs of a few seconds.
> >
> > When OVS drops everything (no fwd to other port, which means we are
> > only counting received packets from OVS pov):
> > - master: 12.189Mpps
> > - series: 12.226Mpps (+0.3%)
> >
> > I also tried with enabling/disabling stats support in vhost library
> > (+stats means adding RTE_VHOST_USER_NET_STATS_ENABLE, resp. -stats
> > means removing it):
> > - master+stats: 11.962Mpps (-1.9%)
> > - series-stats: 12.453Mpps (+2.1%)
> >
> >
> > When OVS fwds (which should be what you tested, correct?):
> > - master: 7.830Mpps
> > - series: 7.795Mpps (-0.5%)
> >
> > - master+stats: 7.641Mpps (-2.5%)
> > - series-stats: 7.967Mpps (+1.7%)
>
>
> My setup is a bit different, because it's bidirectional.  I have 2
> testpmd apps.  One in txonly mode, the other is in mac mode.
> On OVS side I have 2 PMD threads.  Ports have a single queue pinned
> to its own PMD thread.  All the used cores are on the same NUMA, but
> not siblings.  Having 2 PMD threads doing RX and TX on the same port
> is an important factor, because of the stats_lock contention.
>
> My numbers are:
>
> actions:NORMAL
>   ---
>   master: 7.496 Mpps
>   series: 7.383 Mpps (-1.5%)
>
> in_port=1,actions:2
> in_port=2,actions:1
>   ---
>   master: 8.296 Mpps
>   series: 8.072 Mpps (-2.7%)

Ok, I can reproduce a drop.
This made me realise one (ugly) issue in my patch and something I can
enhance wrt to locks.
I'll prepare a new revision and test it a bit more.

I still have in mind your lock contention enhancement patch in the
list of things to consider, but let's keep it as a last resort.
Ilya Maximets Jan. 5, 2023, 5:19 p.m. UTC | #5
On 1/5/23 17:54, David Marchand wrote:
> On Thu, Jan 5, 2023 at 2:59 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>> I did some experiments with a simple mono directionnal setup like:
>>> virtio-user testpmd/txonly --> ovs --> virtio-user testpmd/rxonly
>>> I had to restart testpmd between each restart of OVS: it looks like
>>> virtio-user (as server) reconnection has some bug.
>>>
>>> I took care to dedicate one HT by datapath thread (i.e. all physical
>>> cores are isolated, and ovs pmd thread runs on a HT whose sibling is
>>> idle. Same for testpmd rx / tx threads).
>>> I rely on testpmd tx stats, grabing stats for 10 runs of a few seconds.
>>>
>>> When OVS drops everything (no fwd to other port, which means we are
>>> only counting received packets from OVS pov):
>>> - master: 12.189Mpps
>>> - series: 12.226Mpps (+0.3%)
>>>
>>> I also tried with enabling/disabling stats support in vhost library
>>> (+stats means adding RTE_VHOST_USER_NET_STATS_ENABLE, resp. -stats
>>> means removing it):
>>> - master+stats: 11.962Mpps (-1.9%)
>>> - series-stats: 12.453Mpps (+2.1%)
>>>
>>>
>>> When OVS fwds (which should be what you tested, correct?):
>>> - master: 7.830Mpps
>>> - series: 7.795Mpps (-0.5%)
>>>
>>> - master+stats: 7.641Mpps (-2.5%)
>>> - series-stats: 7.967Mpps (+1.7%)
>>
>>
>> My setup is a bit different, because it's bidirectional.  I have 2
>> testpmd apps.  One in txonly mode, the other is in mac mode.
>> On OVS side I have 2 PMD threads.  Ports have a single queue pinned
>> to its own PMD thread.  All the used cores are on the same NUMA, but
>> not siblings.  Having 2 PMD threads doing RX and TX on the same port
>> is an important factor, because of the stats_lock contention.
>>
>> My numbers are:
>>
>> actions:NORMAL
>>   ---
>>   master: 7.496 Mpps
>>   series: 7.383 Mpps (-1.5%)
>>
>> in_port=1,actions:2
>> in_port=2,actions:1
>>   ---
>>   master: 8.296 Mpps
>>   series: 8.072 Mpps (-2.7%)
> 
> Ok, I can reproduce a drop.
> This made me realise one (ugly) issue in my patch and something I can
> enhance wrt to locks.
> I'll prepare a new revision and test it a bit more.
> 
> I still have in mind your lock contention enhancement patch in the
> list of things to consider, but let's keep it as a last resort.

Ack.  Thanks!

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index fff57f7827..659f53cadc 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2363,66 +2363,11 @@  is_vhost_running(struct netdev_dpdk *dev)
     return (netdev_dpdk_get_vid(dev) >= 0 && dev->vhost_reconfigured);
 }
 
-static inline void
-netdev_dpdk_vhost_update_rx_size_counters(struct netdev_stats *stats,
-                                          unsigned int packet_size)
-{
-    /* Hard-coded search for the size bucket. */
-    if (packet_size < 256) {
-        if (packet_size >= 128) {
-            stats->rx_128_to_255_packets++;
-        } else if (packet_size <= 64) {
-            stats->rx_1_to_64_packets++;
-        } else {
-            stats->rx_65_to_127_packets++;
-        }
-    } else {
-        if (packet_size >= 1523) {
-            stats->rx_1523_to_max_packets++;
-        } else if (packet_size >= 1024) {
-            stats->rx_1024_to_1522_packets++;
-        } else if (packet_size < 512) {
-            stats->rx_256_to_511_packets++;
-        } else {
-            stats->rx_512_to_1023_packets++;
-        }
-    }
-}
-
 static inline void
 netdev_dpdk_vhost_update_rx_counters(struct netdev_dpdk *dev,
-                                     struct dp_packet **packets, int count,
                                      int qos_drops)
 {
-    struct netdev_stats *stats = &dev->stats;
-    struct dp_packet *packet;
-    unsigned int packet_size;
-    int i;
-
-    stats->rx_packets += count;
-    stats->rx_dropped += qos_drops;
-    for (i = 0; i < count; i++) {
-        packet = packets[i];
-        packet_size = dp_packet_size(packet);
-
-        if (OVS_UNLIKELY(packet_size < ETH_HEADER_LEN)) {
-            /* This only protects the following multicast counting from
-             * too short packets, but it does not stop the packet from
-             * further processing. */
-            stats->rx_errors++;
-            stats->rx_length_errors++;
-            continue;
-        }
-
-        netdev_dpdk_vhost_update_rx_size_counters(stats, packet_size);
-
-        struct eth_header *eh = (struct eth_header *) dp_packet_data(packet);
-        if (OVS_UNLIKELY(eth_addr_is_multicast(eh->eth_dst))) {
-            stats->multicast++;
-        }
-
-        stats->rx_bytes += packet_size;
-    }
+    dev->stats.rx_dropped += qos_drops;
 
     if (OVS_UNLIKELY(qos_drops)) {
         dev->sw_stats->rx_qos_drops += qos_drops;
@@ -2474,8 +2419,7 @@  netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
     }
 
     rte_spinlock_lock(&dev->stats_lock);
-    netdev_dpdk_vhost_update_rx_counters(dev, batch->packets,
-                                         nb_rx, qos_drops);
+    netdev_dpdk_vhost_update_rx_counters(dev, qos_drops);
     rte_spinlock_unlock(&dev->stats_lock);
 
     batch->count = nb_rx;
@@ -2589,24 +2533,14 @@  netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
 
 static inline void
 netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev,
-                                     struct dp_packet **packets,
-                                     int attempted,
                                      struct netdev_dpdk_sw_stats *sw_stats_add)
 {
     int dropped = sw_stats_add->tx_mtu_exceeded_drops +
                   sw_stats_add->tx_qos_drops +
                   sw_stats_add->tx_failure_drops +
                   sw_stats_add->tx_invalid_hwol_drops;
-    struct netdev_stats *stats = &dev->stats;
-    int sent = attempted - dropped;
-    int i;
 
-    stats->tx_packets += sent;
-    stats->tx_dropped += dropped;
-
-    for (i = 0; i < sent; i++) {
-        stats->tx_bytes += dp_packet_size(packets[i]);
-    }
+    dev->stats.tx_dropped += dropped;
 
     if (OVS_UNLIKELY(dropped || sw_stats_add->tx_retries)) {
         struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
@@ -2795,13 +2729,13 @@  netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
 {
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
     int max_retries = VHOST_ENQ_RETRY_MIN;
-    int cnt, batch_cnt, vhost_batch_cnt;
     int vid = netdev_dpdk_get_vid(dev);
     struct netdev_dpdk_sw_stats stats;
+    int cnt, vhost_batch_cnt;
     struct rte_mbuf **pkts;
     int retries;
 
-    batch_cnt = cnt = dp_packet_batch_size(batch);
+    cnt = dp_packet_batch_size(batch);
     qid = dev->tx_q[qid % netdev->n_txq].map;
     if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured || qid < 0
                      || !(dev->flags & NETDEV_UP))) {
@@ -2851,8 +2785,7 @@  netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
     stats.tx_retries = MIN(retries, max_retries);
 
     rte_spinlock_lock(&dev->stats_lock);
-    netdev_dpdk_vhost_update_tx_counters(dev, batch->packets, batch_cnt,
-                                         &stats);
+    netdev_dpdk_vhost_update_tx_counters(dev, &stats);
     rte_spinlock_unlock(&dev->stats_lock);
 
     pkts = (struct rte_mbuf **) batch->packets;
@@ -3007,41 +2940,304 @@  netdev_dpdk_set_mtu(struct netdev *netdev, int mtu)
     return 0;
 }
 
-static int
-netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier);
-
 static int
 netdev_dpdk_vhost_get_stats(const struct netdev *netdev,
                             struct netdev_stats *stats)
 {
+    struct rte_vhost_stat_name *vhost_stats_names = NULL;
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    struct rte_vhost_stat *vhost_stats = NULL;
+    int vhost_stats_count;
+    int err;
+    int qid;
+    int vid;
 
     ovs_mutex_lock(&dev->mutex);
 
-    rte_spinlock_lock(&dev->stats_lock);
-    /* Supported Stats */
-    stats->rx_packets = dev->stats.rx_packets;
-    stats->tx_packets = dev->stats.tx_packets;
+    if (!is_vhost_running(dev)) {
+        err = EPROTO;
+        goto out;
+    }
+
+    vid = netdev_dpdk_get_vid(dev);
+
+    /* We expect all rxqs have the same number of stats, only query rxq0. */
+    qid = 0 * VIRTIO_QNUM + VIRTIO_TXQ;
+    err = rte_vhost_vring_stats_get_names(vid, qid, NULL, 0);
+    if (err < 0) {
+        err = EPROTO;
+        goto out;
+    }
+
+    vhost_stats_count = err;
+    vhost_stats_names = xcalloc(vhost_stats_count, sizeof *vhost_stats_names);
+    vhost_stats = xcalloc(vhost_stats_count, sizeof *vhost_stats);
+
+    err = rte_vhost_vring_stats_get_names(vid, qid, vhost_stats_names,
+                                          vhost_stats_count);
+    if (err != vhost_stats_count) {
+        err = EPROTO;
+        goto out;
+    }
+
+#define VHOST_RXQ_STATS                                               \
+    VHOST_RXQ_STAT(rx_packets,              "good_packets")           \
+    VHOST_RXQ_STAT(rx_bytes,                "good_bytes")             \
+    VHOST_RXQ_STAT(rx_broadcast_packets,    "broadcast_packets")      \
+    VHOST_RXQ_STAT(multicast,               "multicast_packets")      \
+    VHOST_RXQ_STAT(rx_undersized_errors,    "undersize_packets")      \
+    VHOST_RXQ_STAT(rx_1_to_64_packets,      "size_64_packets")        \
+    VHOST_RXQ_STAT(rx_65_to_127_packets,    "size_65_127_packets")    \
+    VHOST_RXQ_STAT(rx_128_to_255_packets,   "size_128_255_packets")   \
+    VHOST_RXQ_STAT(rx_256_to_511_packets,   "size_256_511_packets")   \
+    VHOST_RXQ_STAT(rx_512_to_1023_packets,  "size_512_1023_packets")  \
+    VHOST_RXQ_STAT(rx_1024_to_1522_packets, "size_1024_1518_packets") \
+    VHOST_RXQ_STAT(rx_1523_to_max_packets,  "size_1519_max_packets")
+
+#define VHOST_RXQ_STAT(MEMBER, NAME) dev->stats.MEMBER = 0;
+    VHOST_RXQ_STATS;
+#undef VHOST_RXQ_STAT
+
+    for (int q = 0; q < dev->up.n_rxq; q++) {
+        qid = q * VIRTIO_QNUM + VIRTIO_TXQ;
+
+        err = rte_vhost_vring_stats_get(vid, qid, vhost_stats,
+                                        vhost_stats_count);
+        if (err != vhost_stats_count) {
+            err = EPROTO;
+            goto out;
+        }
+
+        for (int i = 0; i < vhost_stats_count; i++) {
+#define VHOST_RXQ_STAT(MEMBER, NAME)                                 \
+            if (string_ends_with(vhost_stats_names[i].name, NAME)) { \
+                dev->stats.MEMBER += vhost_stats[i].value;           \
+                continue;                                            \
+            }
+            VHOST_RXQ_STATS;
+#undef VHOST_RXQ_STAT
+        }
+    }
+
+    /* OVS reports 64 bytes and smaller packets into "rx_1_to_64_packets".
+     * Since vhost only reports good packets and has no error counter,
+     * rx_undersized_errors is highjacked (see above) to retrieve
+     * "undersize_packets". */
+    dev->stats.rx_1_to_64_packets += dev->stats.rx_undersized_errors;
+    memset(&dev->stats.rx_undersized_errors, 0xff,
+           sizeof dev->stats.rx_undersized_errors);
+
+#define VHOST_RXQ_STAT(MEMBER, NAME) stats->MEMBER = dev->stats.MEMBER;
+    VHOST_RXQ_STATS;
+#undef VHOST_RXQ_STAT
+
+    free(vhost_stats_names);
+    vhost_stats_names = NULL;
+    free(vhost_stats);
+    vhost_stats = NULL;
+
+    /* We expect all txqs have the same number of stats, only query txq0. */
+    qid = 0 * VIRTIO_QNUM;
+    err = rte_vhost_vring_stats_get_names(vid, qid, NULL, 0);
+    if (err < 0) {
+        err = EPROTO;
+        goto out;
+    }
+
+    vhost_stats_count = err;
+    vhost_stats_names = xcalloc(vhost_stats_count, sizeof *vhost_stats_names);
+    vhost_stats = xcalloc(vhost_stats_count, sizeof *vhost_stats);
+
+    err = rte_vhost_vring_stats_get_names(vid, qid, vhost_stats_names,
+                                          vhost_stats_count);
+    if (err != vhost_stats_count) {
+        err = EPROTO;
+        goto out;
+    }
+
+#define VHOST_TXQ_STATS                                               \
+    VHOST_TXQ_STAT(tx_packets,              "good_packets")           \
+    VHOST_TXQ_STAT(tx_bytes,                "good_bytes")             \
+    VHOST_TXQ_STAT(tx_broadcast_packets,    "broadcast_packets")      \
+    VHOST_TXQ_STAT(tx_multicast_packets,    "multicast_packets")      \
+    VHOST_TXQ_STAT(rx_undersized_errors,    "undersize_packets")      \
+    VHOST_TXQ_STAT(tx_1_to_64_packets,      "size_64_packets")        \
+    VHOST_TXQ_STAT(tx_65_to_127_packets,    "size_65_127_packets")    \
+    VHOST_TXQ_STAT(tx_128_to_255_packets,   "size_128_255_packets")   \
+    VHOST_TXQ_STAT(tx_256_to_511_packets,   "size_256_511_packets")   \
+    VHOST_TXQ_STAT(tx_512_to_1023_packets,  "size_512_1023_packets")  \
+    VHOST_TXQ_STAT(tx_1024_to_1522_packets, "size_1024_1518_packets") \
+    VHOST_TXQ_STAT(tx_1523_to_max_packets,  "size_1519_max_packets")
+
+#define VHOST_TXQ_STAT(MEMBER, NAME) dev->stats.MEMBER = 0;
+    VHOST_TXQ_STATS;
+#undef VHOST_TXQ_STAT
+
+    for (int q = 0; q < dev->up.n_txq; q++) {
+        qid = q * VIRTIO_QNUM;
+
+        err = rte_vhost_vring_stats_get(vid, qid, vhost_stats,
+                                        vhost_stats_count);
+        if (err != vhost_stats_count) {
+            err = EPROTO;
+            goto out;
+        }
+
+        for (int i = 0; i < vhost_stats_count; i++) {
+#define VHOST_TXQ_STAT(MEMBER, NAME)                                 \
+            if (string_ends_with(vhost_stats_names[i].name, NAME)) { \
+                dev->stats.MEMBER += vhost_stats[i].value;           \
+                continue;                                            \
+            }
+            VHOST_TXQ_STATS;
+#undef VHOST_TXQ_STAT
+        }
+    }
+
+    /* OVS reports 64 bytes and smaller packets into "tx_1_to_64_packets".
+     * Same as for rx, rx_undersized_errors is highjacked. */
+    dev->stats.tx_1_to_64_packets += dev->stats.rx_undersized_errors;
+    memset(&dev->stats.rx_undersized_errors, 0xff,
+           sizeof dev->stats.rx_undersized_errors);
+
+#define VHOST_TXQ_STAT(MEMBER, NAME) stats->MEMBER = dev->stats.MEMBER;
+    VHOST_TXQ_STATS;
+#undef VHOST_TXQ_STAT
+
     stats->rx_dropped = dev->stats.rx_dropped;
     stats->tx_dropped = dev->stats.tx_dropped;
-    stats->multicast = dev->stats.multicast;
-    stats->rx_bytes = dev->stats.rx_bytes;
-    stats->tx_bytes = dev->stats.tx_bytes;
-    stats->rx_errors = dev->stats.rx_errors;
-    stats->rx_length_errors = dev->stats.rx_length_errors;
-
-    stats->rx_1_to_64_packets = dev->stats.rx_1_to_64_packets;
-    stats->rx_65_to_127_packets = dev->stats.rx_65_to_127_packets;
-    stats->rx_128_to_255_packets = dev->stats.rx_128_to_255_packets;
-    stats->rx_256_to_511_packets = dev->stats.rx_256_to_511_packets;
-    stats->rx_512_to_1023_packets = dev->stats.rx_512_to_1023_packets;
-    stats->rx_1024_to_1522_packets = dev->stats.rx_1024_to_1522_packets;
-    stats->rx_1523_to_max_packets = dev->stats.rx_1523_to_max_packets;
 
+    err = 0;
+out:
     rte_spinlock_unlock(&dev->stats_lock);
 
     ovs_mutex_unlock(&dev->mutex);
 
+    free(vhost_stats);
+    free(vhost_stats_names);
+
+    return err;
+}
+
+static int
+netdev_dpdk_vhost_get_custom_stats(const struct netdev *netdev,
+                                   struct netdev_custom_stats *custom_stats)
+{
+    struct rte_vhost_stat_name *vhost_stats_names = NULL;
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    struct rte_vhost_stat *vhost_stats = NULL;
+    int vhost_rxq_stats_count;
+    int vhost_txq_stats_count;
+    int stat_offset;
+    int err;
+    int qid;
+    int vid;
+
+    netdev_dpdk_get_sw_custom_stats(netdev, custom_stats);
+    stat_offset = custom_stats->size;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    if (!is_vhost_running(dev)) {
+        goto out;
+    }
+
+    vid = netdev_dpdk_get_vid(dev);
+
+    qid = 0 * VIRTIO_QNUM + VIRTIO_TXQ;
+    err = rte_vhost_vring_stats_get_names(vid, qid, NULL, 0);
+    if (err < 0) {
+        goto out;
+    }
+    vhost_rxq_stats_count = err;
+
+    qid = 0 * VIRTIO_QNUM;
+    err = rte_vhost_vring_stats_get_names(vid, qid, NULL, 0);
+    if (err < 0) {
+        goto out;
+    }
+    vhost_txq_stats_count = err;
+
+    stat_offset += dev->up.n_rxq * vhost_rxq_stats_count;
+    stat_offset += dev->up.n_txq * vhost_txq_stats_count;
+    custom_stats->counters = xrealloc(custom_stats->counters,
+                                      stat_offset *
+                                      sizeof *custom_stats->counters);
+    stat_offset = custom_stats->size;
+
+    vhost_stats_names = xcalloc(vhost_rxq_stats_count,
+                                sizeof *vhost_stats_names);
+    vhost_stats = xcalloc(vhost_rxq_stats_count, sizeof *vhost_stats);
+
+    for (int q = 0; q < dev->up.n_rxq; q++) {
+        qid = q * VIRTIO_QNUM + VIRTIO_TXQ;
+
+        err = rte_vhost_vring_stats_get_names(vid, qid, vhost_stats_names,
+                                              vhost_rxq_stats_count);
+        if (err != vhost_rxq_stats_count) {
+            goto out;
+        }
+
+        err = rte_vhost_vring_stats_get(vid, qid, vhost_stats,
+                                        vhost_rxq_stats_count);
+        if (err != vhost_rxq_stats_count) {
+            goto out;
+        }
+
+        for (int i = 0; i < vhost_rxq_stats_count; i++) {
+            ovs_strlcpy(custom_stats->counters[stat_offset + i].name,
+                        vhost_stats_names[i].name,
+                        NETDEV_CUSTOM_STATS_NAME_SIZE);
+            custom_stats->counters[stat_offset + i].value =
+                 vhost_stats[i].value;
+        }
+        stat_offset += vhost_rxq_stats_count;
+    }
+
+    free(vhost_stats_names);
+    vhost_stats_names = NULL;
+    free(vhost_stats);
+    vhost_stats = NULL;
+
+    vhost_stats_names = xcalloc(vhost_txq_stats_count,
+                                sizeof *vhost_stats_names);
+    vhost_stats = xcalloc(vhost_txq_stats_count, sizeof *vhost_stats);
+
+    for (int q = 0; q < dev->up.n_txq; q++) {
+        qid = q * VIRTIO_QNUM;
+
+        err = rte_vhost_vring_stats_get_names(vid, qid, vhost_stats_names,
+                                              vhost_txq_stats_count);
+        if (err != vhost_txq_stats_count) {
+            goto out;
+        }
+
+        err = rte_vhost_vring_stats_get(vid, qid, vhost_stats,
+                                        vhost_txq_stats_count);
+        if (err != vhost_txq_stats_count) {
+            goto out;
+        }
+
+        for (int i = 0; i < vhost_txq_stats_count; i++) {
+            ovs_strlcpy(custom_stats->counters[stat_offset + i].name,
+                        vhost_stats_names[i].name,
+                        NETDEV_CUSTOM_STATS_NAME_SIZE);
+            custom_stats->counters[stat_offset + i].value =
+                 vhost_stats[i].value;
+        }
+        stat_offset += vhost_txq_stats_count;
+    }
+
+    free(vhost_stats_names);
+    vhost_stats_names = NULL;
+    free(vhost_stats);
+    vhost_stats = NULL;
+
+out:
+    ovs_mutex_unlock(&dev->mutex);
+
+    custom_stats->size = stat_offset;
+
     return 0;
 }
 
@@ -3088,6 +3284,9 @@  netdev_dpdk_convert_xstats(struct netdev_stats *stats,
 #undef DPDK_XSTATS
 }
 
+static int
+netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier);
+
 static int
 netdev_dpdk_get_stats(const struct netdev *netdev, struct netdev_stats *stats)
 {
@@ -3536,6 +3735,7 @@  netdev_dpdk_update_flags__(struct netdev_dpdk *dev,
             if (NETDEV_UP & on) {
                 rte_spinlock_lock(&dev->stats_lock);
                 memset(&dev->stats, 0, sizeof dev->stats);
+                memset(dev->sw_stats, 0, sizeof *dev->sw_stats);
                 rte_spinlock_unlock(&dev->stats_lock);
             }
         }
@@ -5036,6 +5236,11 @@  dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
         dev->tx_q[0].map = 0;
     }
 
+    rte_spinlock_lock(&dev->stats_lock);
+    memset(&dev->stats, 0, sizeof dev->stats);
+    memset(dev->sw_stats, 0, sizeof *dev->sw_stats);
+    rte_spinlock_unlock(&dev->stats_lock);
+
     if (userspace_tso_enabled()) {
         dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
         VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up));
@@ -5096,6 +5301,9 @@  netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
         /* Register client-mode device. */
         vhost_flags |= RTE_VHOST_USER_CLIENT;
 
+        /* Extended per vq statistics. */
+        vhost_flags |= RTE_VHOST_USER_NET_STATS_ENABLE;
+
         /* There is no support for multi-segments buffers. */
         vhost_flags |= RTE_VHOST_USER_LINEARBUF_SUPPORT;
 
@@ -5574,7 +5782,7 @@  static const struct netdev_class dpdk_vhost_class = {
     .send = netdev_dpdk_vhost_send,
     .get_carrier = netdev_dpdk_vhost_get_carrier,
     .get_stats = netdev_dpdk_vhost_get_stats,
-    .get_custom_stats = netdev_dpdk_get_sw_custom_stats,
+    .get_custom_stats = netdev_dpdk_vhost_get_custom_stats,
     .get_status = netdev_dpdk_vhost_user_get_status,
     .reconfigure = netdev_dpdk_vhost_reconfigure,
     .rxq_recv = netdev_dpdk_vhost_rxq_recv,
@@ -5590,7 +5798,7 @@  static const struct netdev_class dpdk_vhost_client_class = {
     .send = netdev_dpdk_vhost_send,
     .get_carrier = netdev_dpdk_vhost_get_carrier,
     .get_stats = netdev_dpdk_vhost_get_stats,
-    .get_custom_stats = netdev_dpdk_get_sw_custom_stats,
+    .get_custom_stats = netdev_dpdk_vhost_get_custom_stats,
     .get_status = netdev_dpdk_vhost_user_get_status,
     .reconfigure = netdev_dpdk_vhost_client_reconfigure,
     .rxq_recv = netdev_dpdk_vhost_rxq_recv,
diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 8dc187a61d..5ef7f8ccdc 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -200,9 +200,10 @@  ADD_VETH(tap1, ns2, br10, "172.31.110.12/24")
 dnl Execute testpmd in background
 on_exit "pkill -f -x -9 'tail -f /dev/null'"
 tail -f /dev/null | dpdk-testpmd --socket-mem="$(cat NUMA_NODE)" --no-pci\
-           --vdev="net_virtio_user,path=$OVS_RUNDIR/dpdkvhostclient0,server=1" \
-           --vdev="net_tap0,iface=tap0" --file-prefix page0 \
-           --single-file-segments -- -a >$OVS_RUNDIR/testpmd-dpdkvhostuserclient0.log 2>&1 &
+    --vdev="net_virtio_user,path=$OVS_RUNDIR/dpdkvhostclient0,queues=2,server=1" \
+    --vdev="net_tap0,iface=tap0" --file-prefix page0 \
+    --single-file-segments -- -a --nb-cores 2 --rxq 2 --txq 2 \
+    >$OVS_RUNDIR/testpmd-dpdkvhostuserclient0.log 2>&1 &
 
 OVS_WAIT_UNTIL([grep "virtio is now ready for processing" ovs-vswitchd.log])
 OVS_WAIT_UNTIL([ip link show dev tap0 | grep -qw LOWER_UP])
@@ -220,9 +221,33 @@  AT_CHECK([ip netns exec ns1 ip addr add 172.31.110.11/24 dev tap0], [],
 
 AT_CHECK([ip netns exec ns1 ip link show], [], [stdout], [stderr])
 AT_CHECK([ip netns exec ns2 ip link show], [], [stdout], [stderr])
-AT_CHECK([ip netns exec ns1 ping -c 4 -I tap0 172.31.110.12], [], [stdout],
+AT_CHECK([ip netns exec ns1 ping -i 0.1 -c 10 -I tap0 172.31.110.12], [], [stdout],
          [stderr])
 
+AT_CHECK([ip netns exec ns1 ip link set tap0 down], [], [stdout], [stderr])
+
+# Wait for stats to be queried ("stats-update-interval")
+sleep 5
+AT_CHECK([ovs-vsctl get interface dpdkvhostuserclient0 statistics], [], [stdout], [stderr])
+
+AT_CHECK([test `ovs-vsctl get interface dpdkvhostuserclient0 statistics:rx_packets` -gt 0 -a dnl
+               `ovs-vsctl get interface dpdkvhostuserclient0 statistics:rx_bytes` -gt 0])
+AT_CHECK([test `ovs-vsctl get interface dpdkvhostuserclient0 statistics:rx_packets` -eq dnl
+               $((`ovs-vsctl get interface dpdkvhostuserclient0 statistics:rx_q0_good_packets` + dnl
+                  `ovs-vsctl get interface dpdkvhostuserclient0 statistics:rx_q1_good_packets`))])
+AT_CHECK([test `ovs-vsctl get interface dpdkvhostuserclient0 statistics:rx_bytes` -eq dnl
+               $((`ovs-vsctl get interface dpdkvhostuserclient0 statistics:rx_q0_good_bytes` + dnl
+                  `ovs-vsctl get interface dpdkvhostuserclient0 statistics:rx_q1_good_bytes`))])
+
+AT_CHECK([test `ovs-vsctl get interface dpdkvhostuserclient0 statistics:tx_packets` -gt 0 -a dnl
+               `ovs-vsctl get interface dpdkvhostuserclient0 statistics:tx_bytes` -gt 0])
+AT_CHECK([test `ovs-vsctl get interface dpdkvhostuserclient0 statistics:tx_packets` -eq dnl
+               $((`ovs-vsctl get interface dpdkvhostuserclient0 statistics:tx_q0_good_packets` + dnl
+                  `ovs-vsctl get interface dpdkvhostuserclient0 statistics:tx_q1_good_packets`))])
+AT_CHECK([test `ovs-vsctl get interface dpdkvhostuserclient0 statistics:tx_bytes` -eq dnl
+               $((`ovs-vsctl get interface dpdkvhostuserclient0 statistics:tx_q0_good_bytes` + dnl
+                  `ovs-vsctl get interface dpdkvhostuserclient0 statistics:tx_q1_good_bytes`))])
+
 dnl Clean up the testpmd now
 pkill -f -x -9 'tail -f /dev/null'