[ovs-dev,RFC] dpif-netdev: only poll enabled vhost queues

Message ID 1554407393-24298-1-git-send-email-david.marchand@redhat.com
State Changes Requested
Headers show
Series
  • [ovs-dev,RFC] dpif-netdev: only poll enabled vhost queues
Related show

Commit Message

David Marchand April 4, 2019, 7:49 p.m.
We currently poll all available queues based on the max queue count
exchanged with the vhost peer and rely on the vhost library in DPDK to
check the vring status beneath.
This can lead to some overhead when we have a lot of unused queues.
This situation happens easily when you provision more queues than you
need for your virtual machines.

To enhance the situation, we can inform the rxq scheduling algorithm to
skip unused queues. All we need is to catch the per rxq notifications
and trigger a rebalance by asking for a port reconfigure (without
changing the port configuration itself).

A consequence of this is that before a device is connected to the vhost
port, no rxq is polled at all.

Signed-off-by: David Marchand <david.marchand@redhat.com>
---

We tried to lower the number of rebalances but we don't have a
satisfying solution at the moment, so this patch rebalances on each
update.

---
 lib/dpif-netdev.c     |  4 +++
 lib/netdev-dpdk.c     | 71 +++++++++++++++++++++++++++++++++++++++++----------
 lib/netdev-provider.h |  5 ++++
 lib/netdev.c          | 10 ++++++++
 lib/netdev.h          |  1 +
 5 files changed, 78 insertions(+), 13 deletions(-)

Comments

Ilya Maximets April 8, 2019, 8:26 a.m. | #1
On 04.04.2019 22:49, David Marchand wrote:
> We currently poll all available queues based on the max queue count
> exchanged with the vhost peer and rely on the vhost library in DPDK to
> check the vring status beneath.
> This can lead to some overhead when we have a lot of unused queues.
> This situation happens easily when you provision more queues than you
> need for your virtual machines.
> 
> To enhance the situation, we can inform the rxq scheduling algorithm to
> skip unused queues. All we need is to catch the per rxq notifications
> and trigger a rebalance by asking for a port reconfigure (without
> changing the port configuration itself).
> 
> A consequence of this is that before a device is connected to the vhost
> port, no rxq is polled at all.
> 
> Signed-off-by: David Marchand <david.marchand@redhat.com>
> ---
> 
> We tried to lower the number of rebalances but we don't have a
> satisfying solution at the moment, so this patch rebalances on each
> update.

Hi.

Triggering the reconfiguration on each vring state change is a bad thing.
This could be abused by the guest to break the host networking by infinite
disabling/enabling queues. Each reconfiguration leads to removing ports
from the PMD port caches and their reloads. On rescheduling all the ports
could be moved to a different PMD threads resulting in EMC/SMC/dpcls
invalidation and subsequent upcalls/packet reorderings.

Same issues was discussed previously while looking at possibility of
vhost-pmd integration (with some test results):
https://mail.openvswitch.org/pipermail/ovs-dev/2016-August/320430.html
One more reference:
7f5f2bd0ce43 ("netdev-dpdk: Avoid reconfiguration on reconnection of same vhost device.")

One possible way to improve your case where guest never uses all the
queues allocated in QEMU is to requesting the reconfiguration only
if we're receiving "vring enabled" event for the queue that has higher
number than ever seen on this device. i.e.:

    if (is_rx && enable && qid >= dev->requested_n_rxq) {
        dev->requested_n_rxq = qid + 1;
        netdev_request_reconfigure(&dev->up);
    }

At least this will limit number of reconfigurations to the number of queues.
But I'm still not completely sure if this change is good enough. (I believe
that we discussed this solution somewhere on a list, but can't find right now).

Anyway, do you have some numbers of how much time PMD thread spends on polling
disabled queues? What the performance improvement you're able to achieve by
avoiding that?

Best regards, Ilya Maximets.
David Marchand April 8, 2019, 1:44 p.m. | #2
Hello Ilya,

On Mon, Apr 8, 2019 at 10:27 AM Ilya Maximets <i.maximets@samsung.com>
wrote:

> On 04.04.2019 22:49, David Marchand wrote:
> > We tried to lower the number of rebalances but we don't have a
> > satisfying solution at the moment, so this patch rebalances on each
> > update.
>
> Hi.
>
> Triggering the reconfiguration on each vring state change is a bad thing.
> This could be abused by the guest to break the host networking by infinite
> disabling/enabling queues. Each reconfiguration leads to removing ports
> from the PMD port caches and their reloads. On rescheduling all the ports
>

I'd say the reconfiguration itself is not wanted here.
Only rebalancing the queues would be enough.


could be moved to a different PMD threads resulting in EMC/SMC/dpcls
> invalidation and subsequent upcalls/packet reorderings.
>

I agree that rebalancing does trigger EMC/SMC/dpcls invalidation when
moving queues.
However, EMC/SMC/dpcls are per pmd specific, where would we have packet
reordering ?



> Same issues was discussed previously while looking at possibility of
> vhost-pmd integration (with some test results):
> https://mail.openvswitch.org/pipermail/ovs-dev/2016-August/320430.html


Thanks for the link, I will test this.



> One more reference:
> 7f5f2bd0ce43 ("netdev-dpdk: Avoid reconfiguration on reconnection of same
> vhost device.")
>

Yes, I saw this patch.
Are we safe against guest drivers/applications that play with
VIRTIO_NET_F_MQ, swapping it continuously ?




> Anyway, do you have some numbers of how much time PMD thread spends on
> polling
> disabled queues? What the performance improvement you're able to achieve by
> avoiding that?
>

With a simple pvp setup of mine.
1c/2t poll two physical ports.
1c/2t poll four vhost ports with 16 queues each.
  Only one queue is enabled on each virtio device attached by the guest.
  The first two virtio devices are bound to the virtio kmod.
  The last two virtio devices are bound to vfio-pci and used to forward
incoming traffic with testpmd.

The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost
queues) to 6.2Mpps (polling only the 4 enabled vhost queues).
Ilya Maximets April 9, 2019, 8:34 a.m. | #3
On 08.04.2019 16:44, David Marchand wrote:
> Hello Ilya,
> 
> On Mon, Apr 8, 2019 at 10:27 AM Ilya Maximets <i.maximets@samsung.com <mailto:i.maximets@samsung.com>> wrote:
> 
>     On 04.04.2019 22:49, David Marchand wrote:
>     > We tried to lower the number of rebalances but we don't have a
>     > satisfying solution at the moment, so this patch rebalances on each
>     > update.
> 
>     Hi.
> 
>     Triggering the reconfiguration on each vring state change is a bad thing.
>     This could be abused by the guest to break the host networking by infinite
>     disabling/enabling queues. Each reconfiguration leads to removing ports
>     from the PMD port caches and their reloads. On rescheduling all the ports
> 
> 
> I'd say the reconfiguration itself is not wanted here.
> Only rebalancing the queues would be enough.

As you correctly mentioned in commit message where will be no real port
configuration changes.

Under 'reconfiguration' I mean datapath reconfiguration and 'rebalancing'
is one of its stages.

> 
> 
>     could be moved to a different PMD threads resulting in EMC/SMC/dpcls
>     invalidation and subsequent upcalls/packet reorderings.
> 
> 
> I agree that rebalancing does trigger EMC/SMC/dpcls invalidation when moving queues.
> However, EMC/SMC/dpcls are per pmd specific, where would we have packet reordering ?

Rx queue could be scheduled to different PMD thread --> new packets will go to different
Tx queue. It's unlikely, but it will depend on device/driver which packets will be sent
first. The main issue here that it happens to other ports, not only to port we're trying
to reconfigure.

> 
> 
> 
>     Same issues was discussed previously while looking at possibility of
>     vhost-pmd integration (with some test results):
>     https://mail.openvswitch.org/pipermail/ovs-dev/2016-August/320430.html
> 
> 
> Thanks for the link, I will test this.
> 
> 
> 
>     One more reference:
>     7f5f2bd0ce43 ("netdev-dpdk: Avoid reconfiguration on reconnection of same vhost device.")
> 
> 
> Yes, I saw this patch.
> Are we safe against guest drivers/applications that play with VIRTIO_NET_F_MQ, swapping it continuously ?

Good point. I didn't test that, but it looks like we're not safe here.
Kernel and DPDK drivers has F_MQ enabled by default so it'll require
some changes/explicit disabling. But I agree that this could produce
issues if someone will do that.

We could probably handle this using 'max seen qp_num' approach. But I'm
not a fan of this. Need to figure out how to do this correctly.

In general, I think, that we should not allow cases where guest is able
to manipulate the host configuration.

> 
> 
> 
> 
>     Anyway, do you have some numbers of how much time PMD thread spends on polling
>     disabled queues? What the performance improvement you're able to achieve by
>     avoiding that?
> 
> 
> With a simple pvp setup of mine.
> 1c/2t poll two physical ports.
> 1c/2t poll four vhost ports with 16 queues each.
>   Only one queue is enabled on each virtio device attached by the guest.
>   The first two virtio devices are bound to the virtio kmod.
>   The last two virtio devices are bound to vfio-pci and used to forward incoming traffic with testpmd.
> 
> The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost queues) to 6.2Mpps (polling only the 4 enabled vhost queues).

That's interesting. However, this doesn't look like a realistic scenario.
In practice you'll need much more PMD threads to handle so many queues.
If you'll add more threads, zeroloss test could show even worse results
if one of idle VMs will periodically change the number of queues. Periodic
latency spikes will cause queue overruns and subsequent packet drops on
hot Rx queues. This could be partially solved by allowing n_rxq to grow only.
However, I'd be happy to have different solution that will not hide number
of queues from the datapath.

> -- 
> David Marchand
Ilya Maximets April 9, 2019, 11:41 a.m. | #4
On 09.04.2019 11:34, Ilya Maximets wrote:
> On 08.04.2019 16:44, David Marchand wrote:
>> Hello Ilya,
>>
>> On Mon, Apr 8, 2019 at 10:27 AM Ilya Maximets <i.maximets@samsung.com <mailto:i.maximets@samsung.com>> wrote:
>>
>>     On 04.04.2019 22:49, David Marchand wrote:
>>     > We tried to lower the number of rebalances but we don't have a
>>     > satisfying solution at the moment, so this patch rebalances on each
>>     > update.
>>
>>     Hi.
>>
>>     Triggering the reconfiguration on each vring state change is a bad thing.
>>     This could be abused by the guest to break the host networking by infinite
>>     disabling/enabling queues. Each reconfiguration leads to removing ports
>>     from the PMD port caches and their reloads. On rescheduling all the ports
>>
>>
>> I'd say the reconfiguration itself is not wanted here.
>> Only rebalancing the queues would be enough.
> 
> As you correctly mentioned in commit message where will be no real port
> configuration changes.
> 
> Under 'reconfiguration' I mean datapath reconfiguration and 'rebalancing'
> is one of its stages.
> 
>>
>>
>>     could be moved to a different PMD threads resulting in EMC/SMC/dpcls
>>     invalidation and subsequent upcalls/packet reorderings.
>>
>>
>> I agree that rebalancing does trigger EMC/SMC/dpcls invalidation when moving queues.
>> However, EMC/SMC/dpcls are per pmd specific, where would we have packet reordering ?
> 
> Rx queue could be scheduled to different PMD thread --> new packets will go to different
> Tx queue. It's unlikely, but it will depend on device/driver which packets will be sent
> first. The main issue here that it happens to other ports, not only to port we're trying
> to reconfigure.
> 
>>
>>
>>
>>     Same issues was discussed previously while looking at possibility of
>>     vhost-pmd integration (with some test results):
>>     https://mail.openvswitch.org/pipermail/ovs-dev/2016-August/320430.html
>>
>>
>> Thanks for the link, I will test this.
>>
>>
>>
>>     One more reference:
>>     7f5f2bd0ce43 ("netdev-dpdk: Avoid reconfiguration on reconnection of same vhost device.")
>>
>>
>> Yes, I saw this patch.
>> Are we safe against guest drivers/applications that play with VIRTIO_NET_F_MQ, swapping it continuously ?
> 
> Good point. I didn't test that, but it looks like we're not safe here.
> Kernel and DPDK drivers has F_MQ enabled by default so it'll require
> some changes/explicit disabling. But I agree that this could produce
> issues if someone will do that.
> 
> We could probably handle this using 'max seen qp_num' approach. But I'm
> not a fan of this. Need to figure out how to do this correctly.
> 
> In general, I think, that we should not allow cases where guest is able
> to manipulate the host configuration.
> 
>>
>>
>>
>>
>>     Anyway, do you have some numbers of how much time PMD thread spends on polling
>>     disabled queues? What the performance improvement you're able to achieve by
>>     avoiding that?
>>
>>
>> With a simple pvp setup of mine.
>> 1c/2t poll two physical ports.
>> 1c/2t poll four vhost ports with 16 queues each.
>>   Only one queue is enabled on each virtio device attached by the guest.
>>   The first two virtio devices are bound to the virtio kmod.
>>   The last two virtio devices are bound to vfio-pci and used to forward incoming traffic with testpmd.
>>
>> The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost queues) to 6.2Mpps (polling only the 4 enabled vhost queues).
> 
> That's interesting. However, this doesn't look like a realistic scenario.
> In practice you'll need much more PMD threads to handle so many queues.
> If you'll add more threads, zeroloss test could show even worse results
> if one of idle VMs will periodically change the number of queues. Periodic
> latency spikes will cause queue overruns and subsequent packet drops on
> hot Rx queues. This could be partially solved by allowing n_rxq to grow only.
> However, I'd be happy to have different solution that will not hide number
> of queues from the datapath.

What do you think about something like this (not even compile tested):
---
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index bd9718824..0cf492a3b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -591,6 +591,8 @@ struct polled_queue {
     struct dp_netdev_rxq *rxq;
     odp_port_t port_no;
     bool emc_enabled;
+    bool enabled;
+    uint64_t change_seq;
 };
 
 /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
@@ -5375,6 +5377,9 @@ pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
         poll_list[i].rxq = poll->rxq;
         poll_list[i].port_no = poll->rxq->port->port_no;
         poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
+        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq);
+        poll_list[i].change_seq =
+                     netdev_get_change_seq(poll->rxq->port->netdev);
         i++;
     }
 
@@ -5440,6 +5445,10 @@ reload:
 
         for (i = 0; i < poll_cnt; i++) {
 
+            if (!poll_list[i].enabled) {
+                continue;
+            }
+
             if (poll_list[i].emc_enabled) {
                 atomic_read_relaxed(&pmd->dp->emc_insert_min,
                                     &pmd->ctx.emc_insert_min);
@@ -5476,6 +5485,15 @@ reload:
             if (reload) {
                 break;
             }
+
+            for (i = 0; i < poll_cnt; i++) {
+                uint64_t current_seq =
+                         netdev_get_change_seq(poll_list[i].rxq->port->netdev);
+                if (poll_list[i].change_seq != current_seq) {
+                    poll_list[i].change_seq = current_seq;
+                    poll_list[i].enabled = netdev_rxq_enabled(poll_list[i].rxq);
+                }
+            }
         }
         pmd_perf_end_iteration(s, rx_packets, tx_packets,
                                pmd_perf_metrics_enabled(pmd));
---
+ replacing of your 'netdev_request_reconfigure(&dev->up)' inside 'vring_state_changed'
with 'netdev_change_seq_changed(&dev->up)'?

This should help to not reconfigure/reschedule everything and not waste much time on
polling. Also this will give ability to faster react on queue state changes.

After that we'll be able to fix reconfiguration on F_MQ manipulations with something
simple like that:
---
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 91af377e8..1937561cc 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -3492,8 +3528,8 @@ new_device(int vid)
                 newnode = dev->socket_id;
             }
 
-            if (dev->requested_n_txq != qp_num
-                || dev->requested_n_rxq != qp_num
+            if (dev->requested_n_txq < qp_num
+                || dev->requested_n_rxq < qp_num
                 || dev->requested_socket_id != newnode) {
                 dev->requested_socket_id = newnode;
                 dev->requested_n_rxq = qp_num;
---
Decreasing of 'qp_num' will not cause big issues because we'll not poll
disabled queues.

And there could be one separate change to drop the requested values if socket connection
closed (I hope that guest is not able to control that):
---
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 91af377e8..1937561cc 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -185,13 +185,16 @@ static const struct rte_eth_conf port_conf = {
  */
 static int new_device(int vid);
 static void destroy_device(int vid);
+static void destroy_connection(int vid);
 static int vring_state_changed(int vid, uint16_t queue_id, int enable);
 static const struct vhost_device_ops virtio_net_device_ops =
 {
     .new_device =  new_device,
     .destroy_device = destroy_device,
     .vring_state_changed = vring_state_changed,
-    .features_changed = NULL
+    .features_changed = NULL,
+    .new_connection = NULL,
+    .destroy_connection = destroy_connection
 };
 
 enum { DPDK_RING_SIZE = 256 };
@@ -3462,6 +3465,39 @@ netdev_dpdk_remap_txqs(struct netdev_dpdk *dev)
     free(enabled_queues);
 }
 
+/*
+ * vhost-user socket connection is closed.
+ */
+static void
+destroy_connection(int vid)
+{
+    struct netdev_dpdk *dev;
+    char ifname[IF_NAME_SZ];
+
+    rte_vhost_get_ifname(vid, ifname, sizeof ifname);
+
+    ovs_mutex_lock(&dpdk_mutex);
+    LIST_FOR_EACH(dev, list_node, &dpdk_list) {
+        ovs_mutex_lock(&dev->mutex);
+        if (nullable_string_is_equal(ifname, dev->vhost_id)) {
+            uint32_t qp_num = NR_QUEUE;
+
+            /* Restore the number of queue pairs to default. */
+            if (dev->requested_n_txq != qp_num
+                || dev->requested_n_rxq != qp_num) {
+                dev->requested_n_rxq = qp_num;
+                dev->requested_n_txq = qp_num;
+                netdev_request_reconfigure(&dev->up);
+            }
+            ovs_mutex_unlock(&dev->mutex);
+            break;
+        }
+        ovs_mutex_unlock(&dev->mutex);
+    }
+    ovs_mutex_unlock(&dpdk_mutex);
+}
+
+
 /*
  * A new virtio-net device is added to a vhost port.
  */
---

Best regards, Ilya Maximets.
Kevin Traynor April 9, 2019, 2:49 p.m. | #5
On 09/04/2019 12:41, Ilya Maximets wrote:
> On 09.04.2019 11:34, Ilya Maximets wrote:
>> On 08.04.2019 16:44, David Marchand wrote:
>>> Hello Ilya,
>>>
>>> On Mon, Apr 8, 2019 at 10:27 AM Ilya Maximets <i.maximets@samsung.com <mailto:i.maximets@samsung.com>> wrote:
>>>
>>>     On 04.04.2019 22:49, David Marchand wrote:
>>>     > We tried to lower the number of rebalances but we don't have a
>>>     > satisfying solution at the moment, so this patch rebalances on each
>>>     > update.
>>>
>>>     Hi.
>>>
>>>     Triggering the reconfiguration on each vring state change is a bad thing.
>>>     This could be abused by the guest to break the host networking by infinite
>>>     disabling/enabling queues. Each reconfiguration leads to removing ports
>>>     from the PMD port caches and their reloads. On rescheduling all the ports
>>>
>>>
>>> I'd say the reconfiguration itself is not wanted here.
>>> Only rebalancing the queues would be enough.
>>
>> As you correctly mentioned in commit message where will be no real port
>> configuration changes.
>>
>> Under 'reconfiguration' I mean datapath reconfiguration and 'rebalancing'
>> is one of its stages.
>>
>>>
>>>
>>>     could be moved to a different PMD threads resulting in EMC/SMC/dpcls
>>>     invalidation and subsequent upcalls/packet reorderings.
>>>
>>>
>>> I agree that rebalancing does trigger EMC/SMC/dpcls invalidation when moving queues.
>>> However, EMC/SMC/dpcls are per pmd specific, where would we have packet reordering ?
>>
>> Rx queue could be scheduled to different PMD thread --> new packets will go to different
>> Tx queue. It's unlikely, but it will depend on device/driver which packets will be sent
>> first. The main issue here that it happens to other ports, not only to port we're trying
>> to reconfigure.
>>
>>>
>>>
>>>
>>>     Same issues was discussed previously while looking at possibility of
>>>     vhost-pmd integration (with some test results):
>>>     https://mail.openvswitch.org/pipermail/ovs-dev/2016-August/320430.html
>>>
>>>
>>> Thanks for the link, I will test this.
>>>
>>>
>>>
>>>     One more reference:
>>>     7f5f2bd0ce43 ("netdev-dpdk: Avoid reconfiguration on reconnection of same vhost device.")
>>>
>>>
>>> Yes, I saw this patch.
>>> Are we safe against guest drivers/applications that play with VIRTIO_NET_F_MQ, swapping it continuously ?
>>
>> Good point. I didn't test that, but it looks like we're not safe here.
>> Kernel and DPDK drivers has F_MQ enabled by default so it'll require
>> some changes/explicit disabling. But I agree that this could produce
>> issues if someone will do that.
>>
>> We could probably handle this using 'max seen qp_num' approach. But I'm
>> not a fan of this. Need to figure out how to do this correctly.
>>
>> In general, I think, that we should not allow cases where guest is able
>> to manipulate the host configuration.
>>
>>>
>>>
>>>
>>>
>>>     Anyway, do you have some numbers of how much time PMD thread spends on polling
>>>     disabled queues? What the performance improvement you're able to achieve by
>>>     avoiding that?
>>>
>>>
>>> With a simple pvp setup of mine.
>>> 1c/2t poll two physical ports.
>>> 1c/2t poll four vhost ports with 16 queues each.
>>>   Only one queue is enabled on each virtio device attached by the guest.
>>>   The first two virtio devices are bound to the virtio kmod.
>>>   The last two virtio devices are bound to vfio-pci and used to forward incoming traffic with testpmd.
>>>
>>> The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost queues) to 6.2Mpps (polling only the 4 enabled vhost queues).
>>
>> That's interesting. However, this doesn't look like a realistic scenario.
>> In practice you'll need much more PMD threads to handle so many queues.
>> If you'll add more threads, zeroloss test could show even worse results
>> if one of idle VMs will periodically change the number of queues. Periodic
>> latency spikes will cause queue overruns and subsequent packet drops on
>> hot Rx queues. This could be partially solved by allowing n_rxq to grow only.
>> However, I'd be happy to have different solution that will not hide number
>> of queues from the datapath.
> 
> What do you think about something like this (not even compile tested):
> ---
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index bd9718824..0cf492a3b 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -591,6 +591,8 @@ struct polled_queue {
>      struct dp_netdev_rxq *rxq;
>      odp_port_t port_no;
>      bool emc_enabled;
> +    bool enabled;
> +    uint64_t change_seq;
>  };
>  
>  /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
> @@ -5375,6 +5377,9 @@ pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
>          poll_list[i].rxq = poll->rxq;
>          poll_list[i].port_no = poll->rxq->port->port_no;
>          poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
> +        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq);
> +        poll_list[i].change_seq =
> +                     netdev_get_change_seq(poll->rxq->port->netdev);
>          i++;
>      }
>  
> @@ -5440,6 +5445,10 @@ reload:
>  
>          for (i = 0; i < poll_cnt; i++) {
>  
> +            if (!poll_list[i].enabled) {
> +                continue;
> +            }
> +

I had similar thought that doing something like this might be a good
compromise, but wonder about the possible delays in polling newly
enabled queues.

The state could just be checked directly here,

+            if (!netdev_rxq_enabled(poll_list[i].rxq)) {
+                    continue;
+            }

or if it is too much cost, then only check for disabled to catch the
disable->enable case early. DPDK will handle enable->disable case so we
could avoid the check here and do a slower update when lc > 1024.

+            if (!poll_list[i].enabled) {
+                if (OVS_LIKELY(!netdev_rxq_enabled(poll_list[i].rxq)))
+                    continue;
+                else {
+                    poll_list[i].enabled = true;
+                }
+            }

of course if they cost almost as much as letting DPDK do the checks in
the first place, then they are not worth it.

>              if (poll_list[i].emc_enabled) {
>                  atomic_read_relaxed(&pmd->dp->emc_insert_min,
>                                      &pmd->ctx.emc_insert_min);
> @@ -5476,6 +5485,15 @@ reload:
>              if (reload) {
>                  break;
>              }
> +
> +            for (i = 0; i < poll_cnt; i++) {
> +                uint64_t current_seq =
> +                         netdev_get_change_seq(poll_list[i].rxq->port->netdev);
> +                if (poll_list[i].change_seq != current_seq) {
> +                    poll_list[i].change_seq = current_seq;
> +                    poll_list[i].enabled = netdev_rxq_enabled(poll_list[i].rxq);
> +                }
> +            }
>          }
>          pmd_perf_end_iteration(s, rx_packets, tx_packets,
>                                 pmd_perf_metrics_enabled(pmd));
> ---
> + replacing of your 'netdev_request_reconfigure(&dev->up)' inside 'vring_state_changed'
> with 'netdev_change_seq_changed(&dev->up)'?
> 
> This should help to not reconfigure/reschedule everything and not waste much time on
> polling. Also this will give ability to faster react on queue state changes.
> 
> After that we'll be able to fix reconfiguration on F_MQ manipulations with something
> simple like that:
> ---
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 91af377e8..1937561cc 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -3492,8 +3528,8 @@ new_device(int vid)
>                  newnode = dev->socket_id;
>              }
>  
> -            if (dev->requested_n_txq != qp_num
> -                || dev->requested_n_rxq != qp_num
> +            if (dev->requested_n_txq < qp_num
> +                || dev->requested_n_rxq < qp_num
>                  || dev->requested_socket_id != newnode) {
>                  dev->requested_socket_id = newnode;
>                  dev->requested_n_rxq = qp_num;
> ---
> Decreasing of 'qp_num' will not cause big issues because we'll not poll
> disabled queues.
> 
> And there could be one separate change to drop the requested values if socket connection
> closed (I hope that guest is not able to control that):
> ---
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 91af377e8..1937561cc 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -185,13 +185,16 @@ static const struct rte_eth_conf port_conf = {
>   */
>  static int new_device(int vid);
>  static void destroy_device(int vid);
> +static void destroy_connection(int vid);
>  static int vring_state_changed(int vid, uint16_t queue_id, int enable);
>  static const struct vhost_device_ops virtio_net_device_ops =
>  {
>      .new_device =  new_device,
>      .destroy_device = destroy_device,
>      .vring_state_changed = vring_state_changed,
> -    .features_changed = NULL
> +    .features_changed = NULL,
> +    .new_connection = NULL,
> +    .destroy_connection = destroy_connection
>  };
>  
>  enum { DPDK_RING_SIZE = 256 };
> @@ -3462,6 +3465,39 @@ netdev_dpdk_remap_txqs(struct netdev_dpdk *dev)
>      free(enabled_queues);
>  }
>  
> +/*
> + * vhost-user socket connection is closed.
> + */
> +static void
> +destroy_connection(int vid)
> +{
> +    struct netdev_dpdk *dev;
> +    char ifname[IF_NAME_SZ];
> +
> +    rte_vhost_get_ifname(vid, ifname, sizeof ifname);
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    LIST_FOR_EACH(dev, list_node, &dpdk_list) {
> +        ovs_mutex_lock(&dev->mutex);
> +        if (nullable_string_is_equal(ifname, dev->vhost_id)) {
> +            uint32_t qp_num = NR_QUEUE;
> +
> +            /* Restore the number of queue pairs to default. */
> +            if (dev->requested_n_txq != qp_num
> +                || dev->requested_n_rxq != qp_num) {
> +                dev->requested_n_rxq = qp_num;
> +                dev->requested_n_txq = qp_num;
> +                netdev_request_reconfigure(&dev->up);
> +            }
> +            ovs_mutex_unlock(&dev->mutex);
> +            break;
> +        }
> +        ovs_mutex_unlock(&dev->mutex);
> +    }
> +    ovs_mutex_unlock(&dpdk_mutex);
> +}
> +
> +
>  /*
>   * A new virtio-net device is added to a vhost port.
>   */
> ---
> 
> Best regards, Ilya Maximets.
>
David Marchand April 10, 2019, 7:32 a.m. | #6
Hello Ilya,

On Tue, Apr 9, 2019 at 1:42 PM Ilya Maximets <i.maximets@samsung.com> wrote:

>
> What do you think about something like this (not even compile tested):
> ---
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index bd9718824..0cf492a3b 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -591,6 +591,8 @@ struct polled_queue {
>      struct dp_netdev_rxq *rxq;
>      odp_port_t port_no;
>      bool emc_enabled;
> +    bool enabled;
> +    uint64_t change_seq;
>  };
>
>  /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
> @@ -5375,6 +5377,9 @@ pmd_load_queues_and_ports(struct
> dp_netdev_pmd_thread *pmd,
>          poll_list[i].rxq = poll->rxq;
>          poll_list[i].port_no = poll->rxq->port->port_no;
>          poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
> +        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq);
> +        poll_list[i].change_seq =
> +                     netdev_get_change_seq(poll->rxq->port->netdev);
>          i++;
>      }
>
> @@ -5440,6 +5445,10 @@ reload:
>
>          for (i = 0; i < poll_cnt; i++) {
>
> +            if (!poll_list[i].enabled) {
> +                continue;
> +            }
> +
>              if (poll_list[i].emc_enabled) {
>                  atomic_read_relaxed(&pmd->dp->emc_insert_min,
>                                      &pmd->ctx.emc_insert_min);
> @@ -5476,6 +5485,15 @@ reload:
>              if (reload) {
>                  break;
>              }
> +
> +            for (i = 0; i < poll_cnt; i++) {
> +                uint64_t current_seq =
> +
>  netdev_get_change_seq(poll_list[i].rxq->port->netdev);
> +                if (poll_list[i].change_seq != current_seq) {
> +                    poll_list[i].change_seq = current_seq;
> +                    poll_list[i].enabled =
> netdev_rxq_enabled(poll_list[i].rxq);
> +                }
> +            }
>          }
>          pmd_perf_end_iteration(s, rx_packets, tx_packets,
>                                 pmd_perf_metrics_enabled(pmd));
> ---
> + replacing of your 'netdev_request_reconfigure(&dev->up)' inside
> 'vring_state_changed'
> with 'netdev_change_seq_changed(&dev->up)'?
>

Interesting, I would prefer to have a bitmap of rxq rather than looping on
all and checking the state.
But let me try your approach first.



> This should help to not reconfigure/reschedule everything and not waste
> much time on
> polling. Also this will give ability to faster react on queue state
> changes.
>

I suspect the additional branch could still have an impact, but I won't
know before trying.



> After that we'll be able to fix reconfiguration on F_MQ manipulations with
> something
> simple like that:
> ---
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 91af377e8..1937561cc 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -3492,8 +3528,8 @@ new_device(int vid)
>                  newnode = dev->socket_id;
>              }
>
> -            if (dev->requested_n_txq != qp_num
> -                || dev->requested_n_rxq != qp_num
> +            if (dev->requested_n_txq < qp_num
> +                || dev->requested_n_rxq < qp_num
>                  || dev->requested_socket_id != newnode) {
>                  dev->requested_socket_id = newnode;
>                  dev->requested_n_rxq = qp_num;
> ---
> Decreasing of 'qp_num' will not cause big issues because we'll not poll
> disabled queues.
>

Yes.



> And there could be one separate change to drop the requested values if
> socket connection
> closed (I hope that guest is not able to control that):
>

I don't think the guest can control this part.


---
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 91af377e8..1937561cc 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -185,13 +185,16 @@ static const struct rte_eth_conf port_conf = {
>   */
>  static int new_device(int vid);
>  static void destroy_device(int vid);
> +static void destroy_connection(int vid);
>  static int vring_state_changed(int vid, uint16_t queue_id, int enable);
>  static const struct vhost_device_ops virtio_net_device_ops =
>  {
>      .new_device =  new_device,
>      .destroy_device = destroy_device,
>      .vring_state_changed = vring_state_changed,
> -    .features_changed = NULL
> +    .features_changed = NULL,
> +    .new_connection = NULL,
> +    .destroy_connection = destroy_connection
>  };
>
>  enum { DPDK_RING_SIZE = 256 };
> @@ -3462,6 +3465,39 @@ netdev_dpdk_remap_txqs(struct netdev_dpdk *dev)
>      free(enabled_queues);
>  }
>
> +/*
> + * vhost-user socket connection is closed.
> + */
> +static void
> +destroy_connection(int vid)
> +{
> +    struct netdev_dpdk *dev;
> +    char ifname[IF_NAME_SZ];
> +
> +    rte_vhost_get_ifname(vid, ifname, sizeof ifname);
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    LIST_FOR_EACH(dev, list_node, &dpdk_list) {
> +        ovs_mutex_lock(&dev->mutex);
> +        if (nullable_string_is_equal(ifname, dev->vhost_id)) {
> +            uint32_t qp_num = NR_QUEUE;
> +
> +            /* Restore the number of queue pairs to default. */
> +            if (dev->requested_n_txq != qp_num
> +                || dev->requested_n_rxq != qp_num) {
> +                dev->requested_n_rxq = qp_num;
> +                dev->requested_n_txq = qp_num;
> +                netdev_request_reconfigure(&dev->up);
> +            }
> +            ovs_mutex_unlock(&dev->mutex);
> +            break;
> +        }
> +        ovs_mutex_unlock(&dev->mutex);
> +    }
> +    ovs_mutex_unlock(&dpdk_mutex);
> +}
> +
> +
>  /*
>   * A new virtio-net device is added to a vhost port.
>   */
>
>
Ilya Maximets April 10, 2019, 7:59 a.m. | #7
On 09.04.2019 17:49, Kevin Traynor wrote:
> On 09/04/2019 12:41, Ilya Maximets wrote:
>> On 09.04.2019 11:34, Ilya Maximets wrote:
>>> On 08.04.2019 16:44, David Marchand wrote:
>>>> Hello Ilya,
>>>>
>>>> On Mon, Apr 8, 2019 at 10:27 AM Ilya Maximets <i.maximets@samsung.com <mailto:i.maximets@samsung.com>> wrote:
>>>>
>>>>     On 04.04.2019 22:49, David Marchand wrote:
>>>>     > We tried to lower the number of rebalances but we don't have a
>>>>     > satisfying solution at the moment, so this patch rebalances on each
>>>>     > update.
>>>>
>>>>     Hi.
>>>>
>>>>     Triggering the reconfiguration on each vring state change is a bad thing.
>>>>     This could be abused by the guest to break the host networking by infinite
>>>>     disabling/enabling queues. Each reconfiguration leads to removing ports
>>>>     from the PMD port caches and their reloads. On rescheduling all the ports
>>>>
>>>>
>>>> I'd say the reconfiguration itself is not wanted here.
>>>> Only rebalancing the queues would be enough.
>>>
>>> As you correctly mentioned in commit message where will be no real port
>>> configuration changes.
>>>
>>> Under 'reconfiguration' I mean datapath reconfiguration and 'rebalancing'
>>> is one of its stages.
>>>
>>>>
>>>>
>>>>     could be moved to a different PMD threads resulting in EMC/SMC/dpcls
>>>>     invalidation and subsequent upcalls/packet reorderings.
>>>>
>>>>
>>>> I agree that rebalancing does trigger EMC/SMC/dpcls invalidation when moving queues.
>>>> However, EMC/SMC/dpcls are per pmd specific, where would we have packet reordering ?
>>>
>>> Rx queue could be scheduled to different PMD thread --> new packets will go to different
>>> Tx queue. It's unlikely, but it will depend on device/driver which packets will be sent
>>> first. The main issue here that it happens to other ports, not only to port we're trying
>>> to reconfigure.
>>>
>>>>
>>>>
>>>>
>>>>     Same issues was discussed previously while looking at possibility of
>>>>     vhost-pmd integration (with some test results):
>>>>     https://mail.openvswitch.org/pipermail/ovs-dev/2016-August/320430.html
>>>>
>>>>
>>>> Thanks for the link, I will test this.
>>>>
>>>>
>>>>
>>>>     One more reference:
>>>>     7f5f2bd0ce43 ("netdev-dpdk: Avoid reconfiguration on reconnection of same vhost device.")
>>>>
>>>>
>>>> Yes, I saw this patch.
>>>> Are we safe against guest drivers/applications that play with VIRTIO_NET_F_MQ, swapping it continuously ?
>>>
>>> Good point. I didn't test that, but it looks like we're not safe here.
>>> Kernel and DPDK drivers has F_MQ enabled by default so it'll require
>>> some changes/explicit disabling. But I agree that this could produce
>>> issues if someone will do that.
>>>
>>> We could probably handle this using 'max seen qp_num' approach. But I'm
>>> not a fan of this. Need to figure out how to do this correctly.
>>>
>>> In general, I think, that we should not allow cases where guest is able
>>> to manipulate the host configuration.
>>>
>>>>
>>>>
>>>>
>>>>
>>>>     Anyway, do you have some numbers of how much time PMD thread spends on polling
>>>>     disabled queues? What the performance improvement you're able to achieve by
>>>>     avoiding that?
>>>>
>>>>
>>>> With a simple pvp setup of mine.
>>>> 1c/2t poll two physical ports.
>>>> 1c/2t poll four vhost ports with 16 queues each.
>>>>   Only one queue is enabled on each virtio device attached by the guest.
>>>>   The first two virtio devices are bound to the virtio kmod.
>>>>   The last two virtio devices are bound to vfio-pci and used to forward incoming traffic with testpmd.
>>>>
>>>> The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost queues) to 6.2Mpps (polling only the 4 enabled vhost queues).
>>>
>>> That's interesting. However, this doesn't look like a realistic scenario.
>>> In practice you'll need much more PMD threads to handle so many queues.
>>> If you'll add more threads, zeroloss test could show even worse results
>>> if one of idle VMs will periodically change the number of queues. Periodic
>>> latency spikes will cause queue overruns and subsequent packet drops on
>>> hot Rx queues. This could be partially solved by allowing n_rxq to grow only.
>>> However, I'd be happy to have different solution that will not hide number
>>> of queues from the datapath.
>>
>> What do you think about something like this (not even compile tested):
>> ---
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index bd9718824..0cf492a3b 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -591,6 +591,8 @@ struct polled_queue {
>>      struct dp_netdev_rxq *rxq;
>>      odp_port_t port_no;
>>      bool emc_enabled;
>> +    bool enabled;
>> +    uint64_t change_seq;
>>  };
>>  
>>  /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
>> @@ -5375,6 +5377,9 @@ pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
>>          poll_list[i].rxq = poll->rxq;
>>          poll_list[i].port_no = poll->rxq->port->port_no;
>>          poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
>> +        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq);
>> +        poll_list[i].change_seq =
>> +                     netdev_get_change_seq(poll->rxq->port->netdev);
>>          i++;
>>      }
>>  
>> @@ -5440,6 +5445,10 @@ reload:
>>  
>>          for (i = 0; i < poll_cnt; i++) {
>>  
>> +            if (!poll_list[i].enabled) {
>> +                continue;
>> +            }
>> +
> 
> I had similar thought that doing something like this might be a good
> compromise, but wonder about the possible delays in polling newly
> enabled queues.
> 
> The state could just be checked directly here,
> 
> +            if (!netdev_rxq_enabled(poll_list[i].rxq)) {
> +                    continue;
> +            }
> 
> or if it is too much cost, then only check for disabled to catch the
> disable->enable case early. DPDK will handle enable->disable case so we
> could avoid the check here and do a slower update when lc > 1024.
> 
> +            if (!poll_list[i].enabled) {
> +                if (OVS_LIKELY(!netdev_rxq_enabled(poll_list[i].rxq)))
> +                    continue;
> +                else {
> +                    poll_list[i].enabled = true;
> +                }
> +            }
> 
> of course if they cost almost as much as letting DPDK do the checks in
> the first place, then they are not worth it.

The case where VM application quickly enables and disables queues in runtime
should not be usual. Most of the enables should happen while initial virtio
device configuration and I think that 1024 iterations is small enough time for
PMD thread to adjust the configuration. Of course, we could test this.

> 
>>              if (poll_list[i].emc_enabled) {
>>                  atomic_read_relaxed(&pmd->dp->emc_insert_min,
>>                                      &pmd->ctx.emc_insert_min);
>> @@ -5476,6 +5485,15 @@ reload:
>>              if (reload) {
>>                  break;
>>              }
>> +
>> +            for (i = 0; i < poll_cnt; i++) {
>> +                uint64_t current_seq =
>> +                         netdev_get_change_seq(poll_list[i].rxq->port->netdev);
>> +                if (poll_list[i].change_seq != current_seq) {
>> +                    poll_list[i].change_seq = current_seq;
>> +                    poll_list[i].enabled = netdev_rxq_enabled(poll_list[i].rxq);
>> +                }
>> +            }
>>          }
>>          pmd_perf_end_iteration(s, rx_packets, tx_packets,
>>                                 pmd_perf_metrics_enabled(pmd));
>> ---
>> + replacing of your 'netdev_request_reconfigure(&dev->up)' inside 'vring_state_changed'
>> with 'netdev_change_seq_changed(&dev->up)'?
>>
>> This should help to not reconfigure/reschedule everything and not waste much time on
>> polling. Also this will give ability to faster react on queue state changes.
>>
>> After that we'll be able to fix reconfiguration on F_MQ manipulations with something
>> simple like that:
>> ---
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 91af377e8..1937561cc 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -3492,8 +3528,8 @@ new_device(int vid)
>>                  newnode = dev->socket_id;
>>              }
>>  
>> -            if (dev->requested_n_txq != qp_num
>> -                || dev->requested_n_rxq != qp_num
>> +            if (dev->requested_n_txq < qp_num
>> +                || dev->requested_n_rxq < qp_num
>>                  || dev->requested_socket_id != newnode) {
>>                  dev->requested_socket_id = newnode;
>>                  dev->requested_n_rxq = qp_num;
>> ---
>> Decreasing of 'qp_num' will not cause big issues because we'll not poll
>> disabled queues.
>>
>> And there could be one separate change to drop the requested values if socket connection
>> closed (I hope that guest is not able to control that):
>> ---
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 91af377e8..1937561cc 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -185,13 +185,16 @@ static const struct rte_eth_conf port_conf = {
>>   */
>>  static int new_device(int vid);
>>  static void destroy_device(int vid);
>> +static void destroy_connection(int vid);
>>  static int vring_state_changed(int vid, uint16_t queue_id, int enable);
>>  static const struct vhost_device_ops virtio_net_device_ops =
>>  {
>>      .new_device =  new_device,
>>      .destroy_device = destroy_device,
>>      .vring_state_changed = vring_state_changed,
>> -    .features_changed = NULL
>> +    .features_changed = NULL,
>> +    .new_connection = NULL,
>> +    .destroy_connection = destroy_connection
>>  };
>>  
>>  enum { DPDK_RING_SIZE = 256 };
>> @@ -3462,6 +3465,39 @@ netdev_dpdk_remap_txqs(struct netdev_dpdk *dev)
>>      free(enabled_queues);
>>  }
>>  
>> +/*
>> + * vhost-user socket connection is closed.
>> + */
>> +static void
>> +destroy_connection(int vid)
>> +{
>> +    struct netdev_dpdk *dev;
>> +    char ifname[IF_NAME_SZ];
>> +
>> +    rte_vhost_get_ifname(vid, ifname, sizeof ifname);
>> +
>> +    ovs_mutex_lock(&dpdk_mutex);
>> +    LIST_FOR_EACH(dev, list_node, &dpdk_list) {
>> +        ovs_mutex_lock(&dev->mutex);
>> +        if (nullable_string_is_equal(ifname, dev->vhost_id)) {
>> +            uint32_t qp_num = NR_QUEUE;
>> +
>> +            /* Restore the number of queue pairs to default. */
>> +            if (dev->requested_n_txq != qp_num
>> +                || dev->requested_n_rxq != qp_num) {
>> +                dev->requested_n_rxq = qp_num;
>> +                dev->requested_n_txq = qp_num;
>> +                netdev_request_reconfigure(&dev->up);
>> +            }
>> +            ovs_mutex_unlock(&dev->mutex);
>> +            break;
>> +        }
>> +        ovs_mutex_unlock(&dev->mutex);
>> +    }
>> +    ovs_mutex_unlock(&dpdk_mutex);
>> +}
>> +
>> +
>>  /*
>>   * A new virtio-net device is added to a vhost port.
>>   */
>> ---
>>
>> Best regards, Ilya Maximets.
>>
> 
> 
>
Jan Scheurich April 10, 2019, 1:12 p.m. | #8
Hi Ilya,

> >
> > With a simple pvp setup of mine.
> > 1c/2t poll two physical ports.
> > 1c/2t poll four vhost ports with 16 queues each.
> >   Only one queue is enabled on each virtio device attached by the guest.
> >   The first two virtio devices are bound to the virtio kmod.
> >   The last two virtio devices are bound to vfio-pci and used to forward
> incoming traffic with testpmd.
> >
> > The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost queues)
> to 6.2Mpps (polling only the 4 enabled vhost queues).
> 
> That's interesting. However, this doesn't look like a realistic scenario.
> In practice you'll need much more PMD threads to handle so many queues.
> If you'll add more threads, zeroloss test could show even worse results if one of
> idle VMs will periodically change the number of queues. Periodic latency spikes
> will cause queue overruns and subsequent packet drops on hot Rx queues. This
> could be partially solved by allowing n_rxq to grow only.
> However, I'd be happy to have different solution that will not hide number of
> queues from the datapath.
> 

I am afraid it is not a valid assumption that there will be similarly large number of OVS PMD threads as there are queues. 

In OpenStack deployments the OVS is typically statically configured to use a few dedicated host CPUs for PMDs (perhaps 2-8).

Typical Telco VNF VMs, on the other hand, are very large (12-20 vCPUs or even more). If they enable an instance for multi-queue in Nova, Nova (in its eternal wisdom) will set up every vhostuser port with #vCPU queue pairs. A (real world) VM with 20 vCPUs and 6 ports would have 120 queue pairs, even if only one or two high-traffic ports can actually profit from multi-queue. Even on those ports is it unlikely that the application will use all 16 queues. And often there would be another such VM on the second NUMA node.

So, as soon as a VNF enables MQ in OpenStack, there will typically be a vast number of un-used queue pairs in OVS and it makes a lot of sense to minimize the run-time impact of having these around. 

We have had discussion earlier with RedHat as to how a vhostuser backend like OVS could negotiate the number of queue pairs with Qemu down to a reasonable value (e.g. the number PMDs available for polling) *before* Qemu would actually start the guest. The guest would then not have to guess on the optimal number of queue pairs to actually activate.

BR, Jan
Ilya Maximets April 10, 2019, 1:41 p.m. | #9
On 10.04.2019 16:12, Jan Scheurich wrote:
> Hi Ilya,
> 
>>>
>>> With a simple pvp setup of mine.
>>> 1c/2t poll two physical ports.
>>> 1c/2t poll four vhost ports with 16 queues each.
>>>   Only one queue is enabled on each virtio device attached by the guest.
>>>   The first two virtio devices are bound to the virtio kmod.
>>>   The last two virtio devices are bound to vfio-pci and used to forward
>> incoming traffic with testpmd.
>>>
>>> The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost queues)
>> to 6.2Mpps (polling only the 4 enabled vhost queues).
>>
>> That's interesting. However, this doesn't look like a realistic scenario.
>> In practice you'll need much more PMD threads to handle so many queues.
>> If you'll add more threads, zeroloss test could show even worse results if one of
>> idle VMs will periodically change the number of queues. Periodic latency spikes
>> will cause queue overruns and subsequent packet drops on hot Rx queues. This
>> could be partially solved by allowing n_rxq to grow only.
>> However, I'd be happy to have different solution that will not hide number of
>> queues from the datapath.
>>
> 
> I am afraid it is not a valid assumption that there will be similarly large number of OVS PMD threads as there are queues. 
> 
> In OpenStack deployments the OVS is typically statically configured to use a few dedicated host CPUs for PMDs (perhaps 2-8).
> 
> Typical Telco VNF VMs, on the other hand, are very large (12-20 vCPUs or even more). If they enable an instance for multi-queue in Nova, Nova (in its eternal wisdom) will set up every vhostuser port with #vCPU queue pairs.

For me, it's an issue of Nova. It's pretty easy to limit the maximum number of queue pairs
to some sane value (the value that could be handled by your number of available PMD threads).
It'll be a one config and a small patch to nova-compute. With a bit more work you could make
this per-port configurable and finally stop wasting HW resources.

> A (real world) VM with 20 vCPUs and 6 ports would have 120 queue pairs, even if only one or two high-traffic ports can actually profit from multi-queue. Even on those ports is it unlikely that the application will use all 16 queues. And often there would be another such VM on the second NUMA node.

With limiting the number of queues in Nova (like I described above) to 4 you'll have just
24 queues for 6 ports. If you'll make it per-port, you'll be able to limit this to even
more sane values.

> 
> So, as soon as a VNF enables MQ in OpenStack, there will typically be a vast number of un-used queue pairs in OVS and it makes a lot of sense to minimize the run-time impact of having these around. 

For me it seems like not an OVS, DPDK or QEMU issue. The orchestrator should configure sane
values first of all. It's totally unclear why we're changing OVS instead of changing Nova.

> 
> We have had discussion earlier with RedHat as to how a vhostuser backend like OVS could negotiate the number of queue pairs with Qemu down to a reasonable value (e.g. the number PMDs available for polling) *before* Qemu would actually start the guest. The guest would then not have to guess on the optimal number of queue pairs to actually activate.
> 
> BR, Jan
>
David Marchand April 10, 2019, 2:10 p.m. | #10
On Wed, Apr 10, 2019 at 9:32 AM David Marchand <david.marchand@redhat.com>
wrote:

> On Tue, Apr 9, 2019 at 1:42 PM Ilya Maximets <i.maximets@samsung.com>
> wrote:
>
>>
>> What do you think about something like this (not even compile tested):
>> ---
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index bd9718824..0cf492a3b 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -591,6 +591,8 @@ struct polled_queue {
>>      struct dp_netdev_rxq *rxq;
>>      odp_port_t port_no;
>>      bool emc_enabled;
>> +    bool enabled;
>> +    uint64_t change_seq;
>>  };
>>
>>  /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
>> @@ -5375,6 +5377,9 @@ pmd_load_queues_and_ports(struct
>> dp_netdev_pmd_thread *pmd,
>>          poll_list[i].rxq = poll->rxq;
>>          poll_list[i].port_no = poll->rxq->port->port_no;
>>          poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
>> +        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq);
>> +        poll_list[i].change_seq =
>> +                     netdev_get_change_seq(poll->rxq->port->netdev);
>>          i++;
>>      }
>>
>> @@ -5440,6 +5445,10 @@ reload:
>>
>>          for (i = 0; i < poll_cnt; i++) {
>>
>> +            if (!poll_list[i].enabled) {
>> +                continue;
>> +            }
>> +
>>              if (poll_list[i].emc_enabled) {
>>                  atomic_read_relaxed(&pmd->dp->emc_insert_min,
>>                                      &pmd->ctx.emc_insert_min);
>> @@ -5476,6 +5485,15 @@ reload:
>>              if (reload) {
>>                  break;
>>              }
>> +
>> +            for (i = 0; i < poll_cnt; i++) {
>> +                uint64_t current_seq =
>> +
>>  netdev_get_change_seq(poll_list[i].rxq->port->netdev);
>> +                if (poll_list[i].change_seq != current_seq) {
>> +                    poll_list[i].change_seq = current_seq;
>> +                    poll_list[i].enabled =
>> netdev_rxq_enabled(poll_list[i].rxq);
>> +                }
>> +            }
>>          }
>>          pmd_perf_end_iteration(s, rx_packets, tx_packets,
>>                                 pmd_perf_metrics_enabled(pmd));
>> ---
>> + replacing of your 'netdev_request_reconfigure(&dev->up)' inside
>> 'vring_state_changed'
>> with 'netdev_change_seq_changed(&dev->up)'?
>>
>
> Interesting, I would prefer to have a bitmap of rxq rather than looping on
> all and checking the state.
> But let me try your approach first.
>

Incorporated this change, and it works fine, I get similar numbers
with/without a bitmap, so I dropped the bitmap.
I would say we are only missing some information in
dpif-netdev/pmd-rxq-show to get a per rxq status.

I will submit this with the fix on F_MQ if this is okay for you.

Thanks!
Ilya Maximets April 10, 2019, 2:24 p.m. | #11
On 10.04.2019 17:10, David Marchand wrote:
> On Wed, Apr 10, 2019 at 9:32 AM David Marchand <david.marchand@redhat.com <mailto:david.marchand@redhat.com>> wrote:
> 
>     On Tue, Apr 9, 2019 at 1:42 PM Ilya Maximets <i.maximets@samsung.com <mailto:i.maximets@samsung.com>> wrote:
> 
> 
>         What do you think about something like this (not even compile tested):
>         ---
>         diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>         index bd9718824..0cf492a3b 100644
>         --- a/lib/dpif-netdev.c
>         +++ b/lib/dpif-netdev.c
>         @@ -591,6 +591,8 @@ struct polled_queue {
>              struct dp_netdev_rxq *rxq;
>              odp_port_t port_no;
>              bool emc_enabled;
>         +    bool enabled;
>         +    uint64_t change_seq;
>          };
> 
>          /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
>         @@ -5375,6 +5377,9 @@ pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
>                  poll_list[i].rxq = poll->rxq;
>                  poll_list[i].port_no = poll->rxq->port->port_no;
>                  poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
>         +        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq);
>         +        poll_list[i].change_seq =
>         +                     netdev_get_change_seq(poll->rxq->port->netdev);
>                  i++;
>              }
> 
>         @@ -5440,6 +5445,10 @@ reload:
> 
>                  for (i = 0; i < poll_cnt; i++) {
> 
>         +            if (!poll_list[i].enabled) {
>         +                continue;
>         +            }
>         +
>                      if (poll_list[i].emc_enabled) {
>                          atomic_read_relaxed(&pmd->dp->emc_insert_min,
>                                              &pmd->ctx.emc_insert_min);
>         @@ -5476,6 +5485,15 @@ reload:
>                      if (reload) {
>                          break;
>                      }
>         +
>         +            for (i = 0; i < poll_cnt; i++) {
>         +                uint64_t current_seq =
>         +                         netdev_get_change_seq(poll_list[i].rxq->port->netdev);
>         +                if (poll_list[i].change_seq != current_seq) {
>         +                    poll_list[i].change_seq = current_seq;
>         +                    poll_list[i].enabled = netdev_rxq_enabled(poll_list[i].rxq);
>         +                }
>         +            }
>                  }
>                  pmd_perf_end_iteration(s, rx_packets, tx_packets,
>                                         pmd_perf_metrics_enabled(pmd));
>         ---
>         + replacing of your 'netdev_request_reconfigure(&dev->up)' inside 'vring_state_changed'
>         with 'netdev_change_seq_changed(&dev->up)'?
> 
> 
>     Interesting, I would prefer to have a bitmap of rxq rather than looping on all and checking the state.
>     But let me try your approach first.
> 
> 
> Incorporated this change, and it works fine, I get similar numbers with/without a bitmap, so I dropped the bitmap.
> I would say we are only missing some information in dpif-netdev/pmd-rxq-show to get a per rxq status.
> 
> I will submit this with the fix on F_MQ if this is okay for you.

Sure.
I'll make a more detailed review and, probably, some tests after that.

> 
> Thanks!
> 
> -- 
> David Marchand
Jan Scheurich April 10, 2019, 3:12 p.m. | #12
> >
> > I am afraid it is not a valid assumption that there will be similarly large
> number of OVS PMD threads as there are queues.
> >
> > In OpenStack deployments the OVS is typically statically configured to use a
> few dedicated host CPUs for PMDs (perhaps 2-8).
> >
> > Typical Telco VNF VMs, on the other hand, are very large (12-20 vCPUs or
> even more). If they enable an instance for multi-queue in Nova, Nova (in its
> eternal wisdom) will set up every vhostuser port with #vCPU queue pairs.
> 
> For me, it's an issue of Nova. It's pretty easy to limit the maximum number of
> queue pairs to some sane value (the value that could be handled by your
> number of available PMD threads).
> It'll be a one config and a small patch to nova-compute. With a bit more work
> you could make this per-port configurable and finally stop wasting HW
> resources.

OK, I fully agree. 
The OpenStack community is slow, though, when it comes to these kind of changes.
Do we have contacts we could push?

> 
> > A (real world) VM with 20 vCPUs and 6 ports would have 120 queue pairs,
> even if only one or two high-traffic ports can actually profit from multi-queue.
> Even on those ports is it unlikely that the application will use all 16 queues. And
> often there would be another such VM on the second NUMA node.
> 
> With limiting the number of queues in Nova (like I described above) to 4 you'll
> have just
> 24 queues for 6 ports. If you'll make it per-port, you'll be able to limit this to
> even more sane values.

Yes, per port configuration in Neutron seems the logical thing for me to do,
rather than a global per instance parameter in the Nova flavor. A per server
setting in Nova compute to limit the number of acceptable queue pairs to
match the OVS configuration might still be useful on top.

> 
> >
> > So, as soon as a VNF enables MQ in OpenStack, there will typically be a vast
> number of un-used queue pairs in OVS and it makes a lot of sense to minimize
> the run-time impact of having these around.
> 
> For me it seems like not an OVS, DPDK or QEMU issue. The orchestrator should
> configure sane values first of all. It's totally unclear why we're changing OVS
> instead of changing Nova.

The VNF orchestrator would request queues based on the applications needs. They
should not need to be aware of the configuration of the infrastructure (such as
the number of PMD threads in OVS). The OpenStack operator would have to make
sure that the instantiated queues are a good compromise between application
needs and infra capabilities.

> 
> >
> > We have had discussion earlier with RedHat as to how a vhostuser backend
> like OVS could negotiate the number of queue pairs with Qemu down to a
> reasonable value (e.g. the number PMDs available for polling) *before* Qemu
> would actually start the guest. The guest would then not have to guess on the
> optimal number of queue pairs to actually activate.
> >
> > BR, Jan
> >
David Marchand April 11, 2019, 2:03 p.m. | #13
On Wed, Apr 10, 2019 at 4:24 PM Ilya Maximets <i.maximets@samsung.com>
wrote:

> On 10.04.2019 17:10, David Marchand wrote:
> > Incorporated this change, and it works fine, I get similar numbers
> with/without a bitmap, so I dropped the bitmap.
> > I would say we are only missing some information in
> dpif-netdev/pmd-rxq-show to get a per rxq status.
> >
> > I will submit this with the fix on F_MQ if this is okay for you.
>
> Sure.
> I'll make a more detailed review and, probably, some tests after that.
>
>
Sent in the series:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=102214
Thanks.

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 481ef50..7b34ccc 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4546,6 +4546,10 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
         for (int qid = 0; qid < port->n_rxq; qid++) {
             struct dp_netdev_rxq *q = &port->rxqs[qid];
 
+            /* skip disabled rxq */
+            if (!netdev_rxq_enabled(q->rx)) {
+                continue;
+            }
             if (pinned && q->core_id != OVS_CORE_UNSPEC) {
                 struct dp_netdev_pmd_thread *pmd;
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 4bf0ca9..8f8fd1a 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -326,6 +326,10 @@  struct dpdk_mp {
      struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
  };
 
+struct dpdk_rx_queue {
+    bool enabled;
+};
+
 /* There should be one 'struct dpdk_tx_queue' created for
  * each cpu core. */
 struct dpdk_tx_queue {
@@ -433,6 +437,8 @@  struct netdev_dpdk {
         OVSRCU_TYPE(struct ingress_policer *) ingress_policer;
         uint32_t policer_rate;
         uint32_t policer_burst;
+
+        struct dpdk_rx_queue *rx_q;
     );
 
     PADDED_MEMBERS(CACHE_LINE_SIZE,
@@ -1119,6 +1125,12 @@  netdev_dpdk_alloc(void)
     return NULL;
 }
 
+static struct dpdk_rx_queue *
+netdev_dpdk_alloc_rxq(unsigned int n_rxqs)
+{
+    return dpdk_rte_mzalloc(n_rxqs * sizeof(struct dpdk_rx_queue));
+}
+
 static struct dpdk_tx_queue *
 netdev_dpdk_alloc_txq(unsigned int n_txqs)
 {
@@ -1245,6 +1257,10 @@  vhost_common_construct(struct netdev *netdev)
     int socket_id = rte_lcore_to_socket_id(rte_get_master_lcore());
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 
+    dev->rx_q = netdev_dpdk_alloc_rxq(OVS_VHOST_MAX_QUEUE_NUM);
+    if (!dev->rx_q) {
+        return ENOMEM;
+    }
     dev->tx_q = netdev_dpdk_alloc_txq(OVS_VHOST_MAX_QUEUE_NUM);
     if (!dev->tx_q) {
         return ENOMEM;
@@ -1360,6 +1376,7 @@  common_destruct(struct netdev_dpdk *dev)
     OVS_REQUIRES(dpdk_mutex)
     OVS_EXCLUDED(dev->mutex)
 {
+    rte_free(dev->rx_q);
     rte_free(dev->tx_q);
     dpdk_mp_put(dev->dpdk_mp);
 
@@ -2205,6 +2222,14 @@  netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
 }
 
 static int
+netdev_dpdk_vhost_rxq_enabled(struct netdev_rxq *rxq)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
+
+    return dev->rx_q[rxq->queue_id].enabled;
+}
+
+static int
 netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch,
                      int *qfill)
 {
@@ -3527,6 +3552,17 @@  new_device(int vid)
 
 /* Clears mapping for all available queues of vhost interface. */
 static void
+netdev_dpdk_rxq_map_clear(struct netdev_dpdk *dev)
+    OVS_REQUIRES(dev->mutex)
+{
+    int i;
+
+    for (i = 0; i < dev->up.n_rxq; i++) {
+        dev->rx_q[i].enabled = false;
+    }
+}
+
+static void
 netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
     OVS_REQUIRES(dev->mutex)
 {
@@ -3559,6 +3595,7 @@  destroy_device(int vid)
             ovs_mutex_lock(&dev->mutex);
             dev->vhost_reconfigured = false;
             ovsrcu_index_set(&dev->vid, -1);
+            netdev_dpdk_rxq_map_clear(dev);
             netdev_dpdk_txq_map_clear(dev);
 
             netdev_change_seq_changed(&dev->up);
@@ -3593,24 +3630,30 @@  vring_state_changed(int vid, uint16_t queue_id, int enable)
     struct netdev_dpdk *dev;
     bool exists = false;
     int qid = queue_id / VIRTIO_QNUM;
+    bool is_rx = (queue_id % VIRTIO_QNUM) == VIRTIO_TXQ;
     char ifname[IF_NAME_SZ];
 
     rte_vhost_get_ifname(vid, ifname, sizeof ifname);
 
-    if (queue_id % VIRTIO_QNUM == VIRTIO_TXQ) {
-        return 0;
-    }
-
     ovs_mutex_lock(&dpdk_mutex);
     LIST_FOR_EACH (dev, list_node, &dpdk_list) {
         ovs_mutex_lock(&dev->mutex);
         if (strncmp(ifname, dev->vhost_id, IF_NAME_SZ) == 0) {
-            if (enable) {
-                dev->tx_q[qid].map = qid;
+            if (is_rx) {
+                bool enabled = dev->rx_q[qid].enabled;
+
+                dev->rx_q[qid].enabled = enable != 0;
+                if (enabled ^ dev->rx_q[qid].enabled) {
+                    netdev_request_reconfigure(&dev->up);
+                }
             } else {
-                dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
+                if (enable) {
+                    dev->tx_q[qid].map = qid;
+                } else {
+                    dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
+                }
+                netdev_dpdk_remap_txqs(dev);
             }
-            netdev_dpdk_remap_txqs(dev);
             exists = true;
             ovs_mutex_unlock(&dev->mutex);
             break;
@@ -3620,9 +3663,9 @@  vring_state_changed(int vid, uint16_t queue_id, int enable)
     ovs_mutex_unlock(&dpdk_mutex);
 
     if (exists) {
-        VLOG_INFO("State of queue %d ( tx_qid %d ) of vhost device '%s'"
-                  "changed to \'%s\'", queue_id, qid, ifname,
-                  (enable == 1) ? "enabled" : "disabled");
+        VLOG_INFO("State of queue %d ( %s_qid %d ) of vhost device '%s' "
+                  "changed to \'%s\'", queue_id, is_rx == true ? "rx" : "tx",
+                  qid, ifname, (enable == 1) ? "enabled" : "disabled");
     } else {
         VLOG_INFO("vHost Device '%s' not found", ifname);
         return -1;
@@ -5014,7 +5057,8 @@  static const struct netdev_class dpdk_vhost_class = {
     .get_stats = netdev_dpdk_vhost_get_stats,
     .get_status = netdev_dpdk_vhost_user_get_status,
     .reconfigure = netdev_dpdk_vhost_reconfigure,
-    .rxq_recv = netdev_dpdk_vhost_rxq_recv
+    .rxq_recv = netdev_dpdk_vhost_rxq_recv,
+    .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
 };
 
 static const struct netdev_class dpdk_vhost_client_class = {
@@ -5028,7 +5072,8 @@  static const struct netdev_class dpdk_vhost_client_class = {
     .get_stats = netdev_dpdk_vhost_get_stats,
     .get_status = netdev_dpdk_vhost_user_get_status,
     .reconfigure = netdev_dpdk_vhost_client_reconfigure,
-    .rxq_recv = netdev_dpdk_vhost_rxq_recv
+    .rxq_recv = netdev_dpdk_vhost_rxq_recv,
+    .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
 };
 
 void
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index fb0c27e..39354db 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -789,6 +789,11 @@  struct netdev_class {
     void (*rxq_destruct)(struct netdev_rxq *);
     void (*rxq_dealloc)(struct netdev_rxq *);
 
+    /* A netdev can report if a queue won't get traffic and should be excluded
+     * from polling (no callback implicitely means that the queue is enabled).
+     */
+    int (*rxq_enabled)(struct netdev_rxq *);
+
     /* Attempts to receive a batch of packets from 'rx'.  In 'batch', the
      * caller supplies 'packets' as the pointer to the beginning of an array
      * of NETDEV_MAX_BURST pointers to dp_packet.  If successful, the
diff --git a/lib/netdev.c b/lib/netdev.c
index 45b50f2..e0ef21f 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -683,6 +683,16 @@  netdev_rxq_close(struct netdev_rxq *rx)
     }
 }
 
+int netdev_rxq_enabled(struct netdev_rxq *rx)
+{
+    bool enabled = true;
+
+    if (rx->netdev->netdev_class->rxq_enabled) {
+        enabled = rx->netdev->netdev_class->rxq_enabled(rx);
+    }
+    return enabled;
+}
+
 /* Attempts to receive a batch of packets from 'rx'.  'batch' should point to
  * the beginning of an array of NETDEV_MAX_BURST pointers to dp_packet.  If
  * successful, this function stores pointers to up to NETDEV_MAX_BURST
diff --git a/lib/netdev.h b/lib/netdev.h
index d94817f..859f5ef 100644
--- a/lib/netdev.h
+++ b/lib/netdev.h
@@ -183,6 +183,7 @@  enum netdev_pt_mode netdev_get_pt_mode(const struct netdev *);
 /* Packet reception. */
 int netdev_rxq_open(struct netdev *, struct netdev_rxq **, int id);
 void netdev_rxq_close(struct netdev_rxq *);
+int netdev_rxq_enabled(struct netdev_rxq *);
 
 const char *netdev_rxq_get_name(const struct netdev_rxq *);
 int netdev_rxq_get_queue_id(const struct netdev_rxq *);