diff mbox

[ovs-dev,RFC] netdev-dpdk: Add SW queue before Vhost-user Rx virtqueue

Message ID 1470667022-31373-1-git-send-email-maxime.coquelin@redhat.com
State RFC
Delegated to: Darrell Ball
Headers show

Commit Message

Maxime Coquelin Aug. 8, 2016, 2:37 p.m. UTC
This patch tries to address a packet drop issue on the Rx path,
which may happen when VM's vCPU is not 1:1 mapped to a physical CPU,
or when the VMs overcommits memory.

In these configurations, the VM is paused for several milliseconds,
resulting in the Rx virtqueue not to be emptied often enought.

This problem can be seen with small packet rates. For example at 0.3Mpps
with a virtqueue of 256, if the VM is paused for 4ms every second,
around 550 packets get dropped per second.
Increasing the virtqueue size to 1024, no more drops are seen.

The first option to solve this issue is to enlarge the virtqueue ring
size, which is being done in Qemu [0]. However, the virtqueue size
cannot exceeds the 1024 limit imposed by S/G lists in the kernel.

This patch introduce SW rings before Vhost Rx virtqueues.
A new "sw_ring_size" option allows to set the ring size at runtime
(0 for disabling it, RTE ring size have to be a power of 2).

For example, to add a SW ring of 1024 elements:
$> set Interface vhost-user1 options:sw_ring_size=1024
Which can be disabled at runtime doing:
$> set Interface vhost-user1 options:sw_ring_size=0

If no SW ring size provided, the behaviour is the same as before this patch.
If SW ring size is set, there are two paths when sending packets:
 1 - The send function has the ownership of the packets (may_steal)
  In this case, we can enqueue the packets and possibly send them later.
  First we enqueue the packets in the SW ring, if SW ring full,
packets get dropped. Then loop to empty the SW ring into the Virtqueue.
  Packets are freed as soon as inserted into the virtqueue.
 2 - The send function doesn't have the ownership of the packets (!may_steal)
  In this case, we cannot enqueue the packets, as we won't have their
ownership as soon as we exit the send function.
  First, we try to empty the SW ring into the virtqueue. If SW ring has
successfully been emptied, we try to send the packets directly into the
virtqueue.

Note that this patch has not been heavily tested, this is more a PoC to
gather your feedback on the feature, and see if better ideas arise.

[0]: https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg0073dd0.html

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/netdev-dpdk.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 148 insertions(+), 23 deletions(-)

Comments

Maxime Coquelin Sept. 6, 2016, 5:05 p.m. UTC | #1
Hi,

On 08/08/2016 04:37 PM, Maxime Coquelin wrote:
> This patch tries to address a packet drop issue on the Rx path,
> which may happen when VM's vCPU is not 1:1 mapped to a physical CPU,
> or when the VMs overcommits memory.
>
> In these configurations, the VM is paused for several milliseconds,
> resulting in the Rx virtqueue not to be emptied often enought.
>
> This problem can be seen with small packet rates. For example at 0.3Mpps
> with a virtqueue of 256, if the VM is paused for 4ms every second,
> around 550 packets get dropped per second.
> Increasing the virtqueue size to 1024, no more drops are seen.
>
> The first option to solve this issue is to enlarge the virtqueue ring
> size, which is being done in Qemu [0]. However, the virtqueue size
> cannot exceeds the 1024 limit imposed by S/G lists in the kernel.
>
> This patch introduce SW rings before Vhost Rx virtqueues.
> A new "sw_ring_size" option allows to set the ring size at runtime
> (0 for disabling it, RTE ring size have to be a power of 2).
>
> For example, to add a SW ring of 1024 elements:
> $> set Interface vhost-user1 options:sw_ring_size=1024
> Which can be disabled at runtime doing:
> $> set Interface vhost-user1 options:sw_ring_size=0
>
> If no SW ring size provided, the behaviour is the same as before this patch.
> If SW ring size is set, there are two paths when sending packets:
>  1 - The send function has the ownership of the packets (may_steal)
>   In this case, we can enqueue the packets and possibly send them later.
>   First we enqueue the packets in the SW ring, if SW ring full,
> packets get dropped. Then loop to empty the SW ring into the Virtqueue.
>   Packets are freed as soon as inserted into the virtqueue.
>  2 - The send function doesn't have the ownership of the packets (!may_steal)
>   In this case, we cannot enqueue the packets, as we won't have their
> ownership as soon as we exit the send function.
>   First, we try to empty the SW ring into the virtqueue. If SW ring has
> successfully been emptied, we try to send the packets directly into the
> virtqueue.
>
> Note that this patch has not been heavily tested, this is more a PoC to
> gather your feedback on the feature, and see if better ideas arise.
>
> [0]: https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg0073dd0.html
>
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>  lib/netdev-dpdk.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 148 insertions(+), 23 deletions(-)
>

This is a gentle reminder.
The patch should be much more simple thanks to reworks done by Ilya
Maximets, but I would appreciate your feedback before rebasing it.

Thanks,
Maxime
diff mbox

Patch

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index ec0d12d..42c7501 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -143,6 +143,7 @@  static char *cuse_dev_name = NULL;    /* Character device cuse_dev_name. */
 static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
 
 #define VHOST_ENQ_RETRY_NUM 8
+#define VHOST_MAX_BURST     32
 
 static const struct rte_eth_conf port_conf = {
     .rxmode = {
@@ -301,6 +302,8 @@  struct dpdk_tx_queue {
                                     * pmd threads (see 'concurrent_txq'). */
     int map;                       /* Mapping of configured vhost-user queues
                                     * to enabled by guest. */
+    struct rte_ring *sw_ring;      /* vhost-user only, intermediate FIFO to
+                                    * buffer packets while VM is paused. */
 };
 
 /* dpdk has no way to remove dpdk ring ethernet devices
@@ -365,6 +368,9 @@  struct netdev_dpdk {
     /* Identifier used to distinguish vhost devices from each other */
     char vhost_id[PATH_MAX];
 
+    /* Size of vhost Txq SW rings - 0 for no rings */
+    int vhost_sw_rings_sz;
+
     /* In dpdk_list. */
     struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
 
@@ -378,6 +384,8 @@  struct netdev_dpdk {
     int requested_n_txq;
     int requested_n_rxq;
 
+    int requested_vhost_sw_rings_sz;
+
     /* Socket ID detected when vHost device is brought up */
     int requested_socket_id;
 
@@ -402,6 +410,8 @@  static int netdev_dpdk_construct(struct netdev *);
 struct ingress_policer *
 netdev_dpdk_get_ingress_policer(const struct netdev_dpdk *dev);
 
+static void netdev_dpdk_free_sw_rings(struct netdev_dpdk *dev);
+
 static bool
 is_dpdk_class(const struct netdev_class *class)
 {
@@ -977,6 +987,7 @@  netdev_dpdk_vhost_destruct(struct netdev *netdev)
     ovs_mutex_unlock(&dev->mutex);
 
     ovs_mutex_lock(&dpdk_mutex);
+    netdev_dpdk_free_sw_rings(dev);
     rte_free(dev->tx_q);
     ovs_list_remove(&dev->list_node);
     dpdk_mp_put(dev->dpdk_mp);
@@ -1034,6 +1045,14 @@  netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args)
         dev->fc_conf.mode = fc_mode_set[tx_fc_en][rx_fc_en];
 
         dpdk_eth_flow_ctrl_setup(dev);
+    } else if (dev->type == DPDK_DEV_VHOST) {
+        int new_ring_sz;
+        new_ring_sz = smap_get_int(args, "sw_ring_size",
+                                   dev->requested_vhost_sw_rings_sz);
+        if (new_ring_sz != dev->requested_vhost_sw_rings_sz) {
+            dev->requested_vhost_sw_rings_sz = new_ring_sz;
+            netdev_request_reconfigure(netdev);
+        }
     }
     ovs_mutex_unlock(&dev->mutex);
 
@@ -1368,11 +1387,13 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
 {
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
     struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
+    struct rte_ring *sw_ring;
     unsigned int total_pkts = cnt;
-    unsigned int qos_pkts = cnt;
-    int retries = 0;
+    int vhost_qid, i, ret, retries = 0;
+	unsigned short tx_pkts, ring_cnt = 0, sent = 0, free_idx = 0;
 
     qid = dev->tx_q[qid % netdev->n_txq].map;
+    vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
 
     if (OVS_UNLIKELY(!is_vhost_running(dev->vid) || qid < 0
                      || !(dev->flags & NETDEV_UP))) {
@@ -1382,43 +1403,98 @@  __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
         goto out;
     }
 
+    sw_ring = dev->tx_q[qid].sw_ring;
+
     rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
 
     /* Check has QoS has been configured for the netdev */
     cnt = netdev_dpdk_qos_run__(dev, cur_pkts, cnt);
-    qos_pkts -= cnt;
+
+    /*
+     * We have the ownership of the packets and an intermediate ring,
+     * so we can enqueue them
+     */
+    if (may_steal && sw_ring) {
+        ring_cnt = rte_ring_free_count(sw_ring);
+        ring_cnt = MIN(cnt, ring_cnt);
+
+        if (OVS_LIKELY(ring_cnt)) {
+            ret = rte_ring_sp_enqueue_bulk(sw_ring,
+                    (void **)pkts, ring_cnt);
+            if (OVS_LIKELY(!ret))
+                sent = free_idx = ring_cnt;
+        }
+    }
+
+    /* Drain the intermediate SW ring */
+	if (sw_ring)
+        ring_cnt = rte_ring_count(sw_ring);
 
     do {
-        int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
-        unsigned int tx_pkts;
+        unsigned short burst_cnt;
+        struct rte_mbuf *ring_pkts[VHOST_MAX_BURST];
 
-        tx_pkts = rte_vhost_enqueue_burst(dev->vid, vhost_qid,
-                                          cur_pkts, cnt);
-        if (OVS_LIKELY(tx_pkts)) {
-            /* Packets have been sent.*/
-            cnt -= tx_pkts;
-            /* Prepare for possible retry.*/
-            cur_pkts = &cur_pkts[tx_pkts];
-        } else {
-            /* No packets sent - do not retry.*/
+        burst_cnt = MIN(ring_cnt, rte_vhost_avail_entries(dev->vid, vhost_qid));
+        burst_cnt = MIN(burst_cnt, VHOST_MAX_BURST);
+
+        if (OVS_UNLIKELY(!burst_cnt)) {
+            /*
+             * Nothing to dequeue, or no space left in virtqueue:
+             * Do not retry
+             */
             break;
         }
-    } while (cnt && (retries++ < VHOST_ENQ_RETRY_NUM));
+
+        ret = rte_ring_sc_dequeue_bulk(sw_ring, (void **)ring_pkts, burst_cnt);
+        if (OVS_UNLIKELY(ret))
+                break;
+
+        tx_pkts = rte_vhost_enqueue_burst(dev->vid, vhost_qid,
+                ring_pkts, burst_cnt);
+
+        /*
+         * Before calling rte_vhost_enqueue_burst(), we ensured not to feed it
+         * with more packets than it could handle.
+         * In the very unlikely case not all the packets have been enqueued,
+         * drop the left ones.
+         */
+        for (i = 0; i < burst_cnt; i++)
+            dp_packet_delete((struct dp_packet *)ring_pkts[i]);
+
+        ring_cnt -= burst_cnt;
+
+    } while (ring_cnt && (retries++ < VHOST_ENQ_RETRY_NUM));
+
+    /*
+     * Either no SW ring,
+     * or intermediate SW ring is empty, try to send the packets
+     * we don't have the ownership.
+     */
+    if (!sw_ring || (!may_steal && !ring_cnt)) {
+        do {
+            tx_pkts = rte_vhost_enqueue_burst(dev->vid, vhost_qid,
+                                              cur_pkts, cnt);
+            if(OVS_LIKELY(tx_pkts)) {
+                cnt -= tx_pkts;
+                sent += tx_pkts;
+                cur_pkts = &cur_pkts[tx_pkts];
+            } else {
+                /* No packet sent - do not retry */
+                break;
+            }
+        } while (cnt && (retries++ < VHOST_ENQ_RETRY_NUM));
+    }
 
     rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
 
     rte_spinlock_lock(&dev->stats_lock);
-    cnt += qos_pkts;
-    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, cnt);
+    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, total_pkts - sent);
     rte_spinlock_unlock(&dev->stats_lock);
 
 out:
     if (may_steal) {
-        int i;
-
-        for (i = 0; i < total_pkts; i++) {
+        for (i = free_idx; i < total_pkts; i++)
             dp_packet_delete(pkts[i]);
-        }
     }
 }
 
@@ -2273,6 +2349,46 @@  netdev_dpdk_remap_txqs(struct netdev_dpdk *dev)
     rte_free(enabled_queues);
 }
 
+static void
+netdev_dpdk_free_sw_rings(struct netdev_dpdk *dev)
+{
+    int i;
+
+    for (i = 0; i < OVS_VHOST_MAX_QUEUE_NUM; i++) {
+        rte_ring_free(dev->tx_q[i].sw_ring);
+        dev->tx_q[i].sw_ring = NULL;
+    }
+}
+
+
+static void
+netdev_dpdk_alloc_sw_rings(struct netdev_dpdk *dev, unsigned int n_txqs)
+{
+    unsigned i;
+
+    if (!dev->vhost_sw_rings_sz)
+        return;
+
+    for (i = 0; i < n_txqs; i++) {
+        char name[PATH_MAX];
+
+        snprintf(name, PATH_MAX, "%s-%u", dev->up.name, i);
+        dev->tx_q[i].sw_ring = rte_ring_lookup(name);
+        if (dev->tx_q[i].sw_ring)
+            continue;
+
+        dev->tx_q[i].sw_ring = rte_ring_create(name, dev->vhost_sw_rings_sz,
+                dev->socket_id, RING_F_SP_ENQ | RING_F_SC_DEQ);
+        if (!dev->tx_q[i].sw_ring)
+            VLOG_ERR("%s: Failed to create SW ring for Tx queue %u\n"
+                     "\tRing name: %s, size %u\n"
+                    ,dev->up.name, i, name, dev->vhost_sw_rings_sz);
+        else
+            VLOG_INFO("%s: Created SW ring of %d entries\n",
+                      name, dev->vhost_sw_rings_sz);
+    }
+}
+
 /*
  * A new virtio-net device is added to a vhost port.
  */
@@ -2877,7 +2993,7 @@  static int
 netdev_dpdk_vhost_user_reconfigure(struct netdev *netdev)
 {
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-    int err = 0;
+    int i, err = 0;
 
     ovs_mutex_lock(&dpdk_mutex);
     ovs_mutex_lock(&dev->mutex);
@@ -2900,8 +3016,17 @@  netdev_dpdk_vhost_user_reconfigure(struct netdev *netdev)
         if (!dev->dpdk_mp) {
             err = ENOMEM;
         }
+
+        for (i = 0; i < netdev->n_txq; i++)
+            rte_ring_free(dev->tx_q[i].sw_ring);
     }
 
+    if (dev->vhost_sw_rings_sz != dev->requested_vhost_sw_rings_sz) {
+        netdev_dpdk_free_sw_rings(dev);
+        dev->vhost_sw_rings_sz = dev->requested_vhost_sw_rings_sz;
+    }
+    netdev_dpdk_alloc_sw_rings(dev, netdev->n_txq);
+
     ovs_mutex_unlock(&dev->mutex);
     ovs_mutex_unlock(&dpdk_mutex);
 
@@ -3445,7 +3570,7 @@  static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class =
         dpdk_vhost_user_class_init,
         netdev_dpdk_vhost_user_construct,
         netdev_dpdk_vhost_destruct,
-        NULL,
+        netdev_dpdk_set_config,
         NULL,
         netdev_dpdk_vhost_send,
         netdev_dpdk_vhost_get_carrier,