diff mbox series

[ovs-dev,1/3] dpif-netdev: only poll enabled vhost queues

Message ID 1554991244-26307-1-git-send-email-david.marchand@redhat.com
State Changes Requested
Headers show
Series [ovs-dev,1/3] dpif-netdev: only poll enabled vhost queues | expand

Commit Message

David Marchand April 11, 2019, 2 p.m. UTC
We currently poll all available queues based on the max queue count
exchanged with the vhost peer and rely on the vhost library in DPDK to
check the vring status beneath.
This can lead to some overhead when we have a lot of unused queues.

To enhance the situation, we can skip the disabled queues.
On rxq notifications, we make use of the netdev's change_seq number so
that the pmd thread main loop can cache the queue state periodically.

$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 1:
  isolated : true
  port: dpdk0             queue-id:  0  pmd usage:  0 %  polling: enabled
pmd thread numa_id 0 core_id 2:
  isolated : true
  port: vhost1            queue-id:  0  pmd usage:  0 %  polling: disabled
  port: vhost3            queue-id:  0  pmd usage:  0 %  polling: disabled
pmd thread numa_id 0 core_id 15:
  isolated : true
  port: dpdk1             queue-id:  0  pmd usage:  0 %  polling: enabled
pmd thread numa_id 0 core_id 16:
  isolated : true
  port: vhost0            queue-id:  0  pmd usage:  0 %  polling: disabled
  port: vhost2            queue-id:  0  pmd usage:  0 %  polling: disabled

$ while true; do
  ovs-appctl dpif-netdev/pmd-rxq-show |awk '
  /polling: / {
    tot++;
    if ($NF == "enabled") {
      en++;
    }
  }
  END {
    print "total: " tot ", enabled: " en
  }'
  sleep 1
done

total: 6, enabled: 2
total: 6, enabled: 2
...

 # Started vm, virtio devices are bound to kernel driver which enables
 # F_MQ + all queue pairs
total: 6, enabled: 2
total: 66, enabled: 66
...

 # Unbound vhost0 and vhost1 from the kernel driver
total: 66, enabled: 66
total: 66, enabled: 34
...

 # Configured kernel bound devices to use only 1 queue pair
total: 66, enabled: 34
total: 66, enabled: 19
total: 66, enabled: 4
...

 # While rebooting the vm
total: 66, enabled: 4
total: 66, enabled: 2
...
total: 66, enabled: 66
...

 # After shutting down the vm
total: 66, enabled: 66
total: 66, enabled: 2

Signed-off-by: David Marchand <david.marchand@redhat.com>
---
 lib/dpif-netdev.c     | 27 ++++++++++++++++++++
 lib/netdev-dpdk.c     | 71 +++++++++++++++++++++++++++++++++++++++++----------
 lib/netdev-provider.h |  5 ++++
 lib/netdev.c          | 10 ++++++++
 lib/netdev.h          |  1 +
 5 files changed, 101 insertions(+), 13 deletions(-)

Comments

David Marchand April 11, 2019, 2:13 p.m. UTC | #1
On Thu, Apr 11, 2019 at 4:02 PM David Marchand <david.marchand@redhat.com>
wrote:

> We currently poll all available queues based on the max queue count
> exchanged with the vhost peer and rely on the vhost library in DPDK to
> check the vring status beneath.
> This can lead to some overhead when we have a lot of unused queues.
>
> To enhance the situation, we can skip the disabled queues.
> On rxq notifications, we make use of the netdev's change_seq number so
> that the pmd thread main loop can cache the queue state periodically.
>
> $ ovs-appctl dpif-netdev/pmd-rxq-show
> pmd thread numa_id 0 core_id 1:
>   isolated : true
>   port: dpdk0             queue-id:  0  pmd usage:  0 %  polling: enabled
> pmd thread numa_id 0 core_id 2:
>   isolated : true
>   port: vhost1            queue-id:  0  pmd usage:  0 %  polling: disabled
>   port: vhost3            queue-id:  0  pmd usage:  0 %  polling: disabled
> pmd thread numa_id 0 core_id 15:
>   isolated : true
>   port: dpdk1             queue-id:  0  pmd usage:  0 %  polling: enabled
> pmd thread numa_id 0 core_id 16:
>   isolated : true
>   port: vhost0            queue-id:  0  pmd usage:  0 %  polling: disabled
>   port: vhost2            queue-id:  0  pmd usage:  0 %  polling: disabled
>

This change broke the tests, since it matches the exact command output.
Is the output format something we must maintain ?
Ilya Maximets April 12, 2019, 10 a.m. UTC | #2
On 11.04.2019 17:13, David Marchand wrote:
> 
> 
> On Thu, Apr 11, 2019 at 4:02 PM David Marchand <david.marchand@redhat.com <mailto:david.marchand@redhat.com>> wrote:
> 
>     We currently poll all available queues based on the max queue count
>     exchanged with the vhost peer and rely on the vhost library in DPDK to
>     check the vring status beneath.
>     This can lead to some overhead when we have a lot of unused queues.
> 
>     To enhance the situation, we can skip the disabled queues.
>     On rxq notifications, we make use of the netdev's change_seq number so
>     that the pmd thread main loop can cache the queue state periodically.
> 
>     $ ovs-appctl dpif-netdev/pmd-rxq-show
>     pmd thread numa_id 0 core_id 1:
>       isolated : true
>       port: dpdk0             queue-id:  0  pmd usage:  0 %  polling: enabled
>     pmd thread numa_id 0 core_id 2:
>       isolated : true
>       port: vhost1            queue-id:  0  pmd usage:  0 %  polling: disabled
>       port: vhost3            queue-id:  0  pmd usage:  0 %  polling: disabled
>     pmd thread numa_id 0 core_id 15:
>       isolated : true
>       port: dpdk1             queue-id:  0  pmd usage:  0 %  polling: enabled
>     pmd thread numa_id 0 core_id 16:
>       isolated : true
>       port: vhost0            queue-id:  0  pmd usage:  0 %  polling: disabled
>       port: vhost2            queue-id:  0  pmd usage:  0 %  polling: disabled
> 
> 
> This change broke the tests, since it matches the exact command output.
> Is the output format something we must maintain ?

I don't think so.
You just need to fix tests to match with the new output format.
However, in general, the output is already overloaded. Maybe it's
worth to print polling state only if it's disabled?
This should also save you from changing most of the tests.

> 
> 
> -- 
> David Marchand
Ilya Maximets April 12, 2019, 12:12 p.m. UTC | #3
On 11.04.2019 17:00, David Marchand wrote:
> We currently poll all available queues based on the max queue count
> exchanged with the vhost peer and rely on the vhost library in DPDK to
> check the vring status beneath.
> This can lead to some overhead when we have a lot of unused queues.
> 
> To enhance the situation, we can skip the disabled queues.
> On rxq notifications, we make use of the netdev's change_seq number so
> that the pmd thread main loop can cache the queue state periodically.
> 
> $ ovs-appctl dpif-netdev/pmd-rxq-show
> pmd thread numa_id 0 core_id 1:
>   isolated : true
>   port: dpdk0             queue-id:  0  pmd usage:  0 %  polling: enabled
> pmd thread numa_id 0 core_id 2:
>   isolated : true
>   port: vhost1            queue-id:  0  pmd usage:  0 %  polling: disabled
>   port: vhost3            queue-id:  0  pmd usage:  0 %  polling: disabled
> pmd thread numa_id 0 core_id 15:
>   isolated : true
>   port: dpdk1             queue-id:  0  pmd usage:  0 %  polling: enabled
> pmd thread numa_id 0 core_id 16:
>   isolated : true
>   port: vhost0            queue-id:  0  pmd usage:  0 %  polling: disabled
>   port: vhost2            queue-id:  0  pmd usage:  0 %  polling: disabled
> 
> $ while true; do
>   ovs-appctl dpif-netdev/pmd-rxq-show |awk '
>   /polling: / {
>     tot++;
>     if ($NF == "enabled") {
>       en++;
>     }
>   }
>   END {
>     print "total: " tot ", enabled: " en
>   }'
>   sleep 1
> done
> 
> total: 6, enabled: 2
> total: 6, enabled: 2
> ...
> 
>  # Started vm, virtio devices are bound to kernel driver which enables
>  # F_MQ + all queue pairs
> total: 6, enabled: 2
> total: 66, enabled: 66
> ...
> 
>  # Unbound vhost0 and vhost1 from the kernel driver
> total: 66, enabled: 66
> total: 66, enabled: 34
> ...
> 
>  # Configured kernel bound devices to use only 1 queue pair
> total: 66, enabled: 34
> total: 66, enabled: 19
> total: 66, enabled: 4
> ...
> 
>  # While rebooting the vm
> total: 66, enabled: 4
> total: 66, enabled: 2
> ...
> total: 66, enabled: 66
> ...
> 
>  # After shutting down the vm
> total: 66, enabled: 66
> total: 66, enabled: 2
> 
> Signed-off-by: David Marchand <david.marchand@redhat.com>
> ---

Hi.
One important issue:
You need to enable queue #0 in case there was no state changes.
We're doing this for Tx queues in 'dpdk_vhost_reconfigure_helper'.
This is because if F_PROTOCOL_FEATURES not negotiated, vrings are
created in enabled state. Required to support older QEMU or legacy
drivers.

I'm working on DPDK patch to trigger 'vring_state_changed' in this
case, however it could take long or never be accepted to LTS.
I'd also like not to check for negotiated features in OVS code.
I think this kind of stuff should be hidden by vhost library from
the application.

More comments inline.

Also, you might want to add Ian to CC list while sending userspace
datapath related patches.

>  lib/dpif-netdev.c     | 27 ++++++++++++++++++++
>  lib/netdev-dpdk.c     | 71 +++++++++++++++++++++++++++++++++++++++++----------
>  lib/netdev-provider.h |  5 ++++
>  lib/netdev.c          | 10 ++++++++
>  lib/netdev.h          |  1 +
>  5 files changed, 101 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 4d6d0c3..c95612f 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -591,6 +591,8 @@ struct polled_queue {
>      struct dp_netdev_rxq *rxq;
>      odp_port_t port_no;
>      bool emc_enabled;
> +    bool enabled;
> +    uint64_t change_seq;
>  };
>  
>  /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
> @@ -1171,6 +1173,9 @@ pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd)
>              } else {
>                  ds_put_format(reply, "%s", "NOT AVAIL");
>              }
> +            ds_put_format(reply, "  polling: %s",
> +                          (netdev_rxq_enabled(list[i].rxq->rx))
> +                          ? "enabled" : "disabled");
>              ds_put_cstr(reply, "\n");
>          }
>          ovs_mutex_unlock(&pmd->port_mutex);
> @@ -5198,6 +5203,11 @@ dpif_netdev_run(struct dpif *dpif)
>                  }
>  
>                  for (i = 0; i < port->n_rxq; i++) {
> +
> +                    if (!netdev_rxq_enabled(port->rxqs[i].rx)) {
> +                        continue;
> +                    }
> +
>                      if (dp_netdev_process_rxq_port(non_pmd,
>                                                     &port->rxqs[i],
>                                                     port->port_no)) {
> @@ -5371,6 +5381,9 @@ pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
>          poll_list[i].rxq = poll->rxq;
>          poll_list[i].port_no = poll->rxq->port->port_no;
>          poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
> +        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq->rx);
> +        poll_list[i].change_seq =
> +                     netdev_get_change_seq(poll->rxq->port->netdev);
>          i++;
>      }
>  
> @@ -5436,6 +5449,10 @@ reload:
>  
>          for (i = 0; i < poll_cnt; i++) {
>  
> +            if (!poll_list[i].enabled) {
> +                continue;
> +            }
> +
>              if (poll_list[i].emc_enabled) {
>                  atomic_read_relaxed(&pmd->dp->emc_insert_min,
>                                      &pmd->ctx.emc_insert_min);
> @@ -5472,6 +5489,16 @@ reload:
>              if (reload) {
>                  break;
>              }
> +
> +            for (i = 0; i < poll_cnt; i++) {
> +                uint64_t current_seq =
> +                         netdev_get_change_seq(poll_list[i].rxq->port->netdev);
> +                if (poll_list[i].change_seq != current_seq) {
> +                    poll_list[i].change_seq = current_seq;
> +                    poll_list[i].enabled =
> +                                 netdev_rxq_enabled(poll_list[i].rxq->rx);
> +                }
> +            }
>          }
>          pmd_perf_end_iteration(s, rx_packets, tx_packets,
>                                 pmd_perf_metrics_enabled(pmd));
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 47153dc..9088ff4 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -317,6 +317,10 @@ struct dpdk_mp {
>       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>   };
>  
> +struct dpdk_rx_queue {
> +    bool enabled;
> +};

I'm thinking if we need to mimic tx_q here.
rxq_enable is vhost specific. We could call an array like 'vhost_rxq_enabled'
and drop the structure. 'free' of this array should happen in '*vhost_destruct',
not in 'common_destruct'. Clearing could be done with 'memset' in this case.

What do you think?

> +
>  /* There should be one 'struct dpdk_tx_queue' created for
>   * each cpu core. */
>  struct dpdk_tx_queue {
> @@ -424,6 +428,8 @@ struct netdev_dpdk {
>          OVSRCU_TYPE(struct ingress_policer *) ingress_policer;
>          uint32_t policer_rate;
>          uint32_t policer_burst;
> +
> +        struct dpdk_rx_queue *rx_q;

Comment would be nice here.

>      );
>  
>      PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -1109,6 +1115,12 @@ netdev_dpdk_alloc(void)
>      return NULL;
>  }
>  
> +static struct dpdk_rx_queue *
> +netdev_dpdk_alloc_rxq(unsigned int n_rxqs)
> +{
> +    return dpdk_rte_mzalloc(n_rxqs * sizeof(struct dpdk_rx_queue));
> +}
> +
>  static struct dpdk_tx_queue *
>  netdev_dpdk_alloc_txq(unsigned int n_txqs)
>  {
> @@ -1235,6 +1247,10 @@ vhost_common_construct(struct netdev *netdev)
>      int socket_id = rte_lcore_to_socket_id(rte_get_master_lcore());
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>  
> +    dev->rx_q = netdev_dpdk_alloc_rxq(OVS_VHOST_MAX_QUEUE_NUM);

Do we need a function for this? Allocating inplace you'll be able to
use more OVS-style sizeof call:

       dev->rx_q = dpdk_rte_mzalloc(OVS_VHOST_MAX_QUEUE_NUM * sizeof *dev->rx_q);

> +    if (!dev->rx_q) {
> +        return ENOMEM;
> +    }
>      dev->tx_q = netdev_dpdk_alloc_txq(OVS_VHOST_MAX_QUEUE_NUM);
>      if (!dev->tx_q) {

Need to free 'rx_q' before return.

>          return ENOMEM;
> @@ -1354,6 +1370,7 @@ common_destruct(struct netdev_dpdk *dev)
>      OVS_REQUIRES(dpdk_mutex)
>      OVS_EXCLUDED(dev->mutex)
>  {
> +    rte_free(dev->rx_q);
>      rte_free(dev->tx_q);
>      dpdk_mp_put(dev->dpdk_mp);
>  
> @@ -2201,6 +2218,14 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>  }
>  
>  static int
> +netdev_dpdk_vhost_rxq_enabled(struct netdev_rxq *rxq)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
> +
> +    return dev->rx_q[rxq->queue_id].enabled;
> +}
> +
> +static int
>  netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch,
>                       int *qfill)
>  {
> @@ -3531,6 +3556,17 @@ new_device(int vid)
>  
>  /* Clears mapping for all available queues of vhost interface. */
>  static void
> +netdev_dpdk_rxq_map_clear(struct netdev_dpdk *dev)

This isn't a 'map' actually.

> +    OVS_REQUIRES(dev->mutex)
> +{
> +    int i;
> +
> +    for (i = 0; i < dev->up.n_rxq; i++) {
> +        dev->rx_q[i].enabled = false;
> +    }
> +}
> +
> +static void
>  netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
>      OVS_REQUIRES(dev->mutex)
>  {
> @@ -3563,6 +3599,7 @@ destroy_device(int vid)
>              ovs_mutex_lock(&dev->mutex);
>              dev->vhost_reconfigured = false;
>              ovsrcu_index_set(&dev->vid, -1);
> +            netdev_dpdk_rxq_map_clear(dev);
>              netdev_dpdk_txq_map_clear(dev);
>  
>              netdev_change_seq_changed(&dev->up);
> @@ -3597,24 +3634,30 @@ vring_state_changed(int vid, uint16_t queue_id, int enable)
>      struct netdev_dpdk *dev;
>      bool exists = false;
>      int qid = queue_id / VIRTIO_QNUM;
> +    bool is_rx = (queue_id % VIRTIO_QNUM) == VIRTIO_TXQ;
>      char ifname[IF_NAME_SZ];
>  
>      rte_vhost_get_ifname(vid, ifname, sizeof ifname);
>  
> -    if (queue_id % VIRTIO_QNUM == VIRTIO_TXQ) {
> -        return 0;
> -    }
> -
>      ovs_mutex_lock(&dpdk_mutex);
>      LIST_FOR_EACH (dev, list_node, &dpdk_list) {
>          ovs_mutex_lock(&dev->mutex);
>          if (nullable_string_is_equal(ifname, dev->vhost_id)) {
> -            if (enable) {
> -                dev->tx_q[qid].map = qid;
> +            if (is_rx) {
> +                bool enabled = dev->rx_q[qid].enabled;
> +
> +                dev->rx_q[qid].enabled = enable != 0;
> +                if (enabled ^ dev->rx_q[qid].enabled) {

Bit ops on a boolean seems not a good practice. It's better to just use '!='.

> +                    netdev_change_seq_changed(&dev->up);
> +                }
>              } else {
> -                dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
> +                if (enable) {
> +                    dev->tx_q[qid].map = qid;
> +                } else {
> +                    dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
> +                }
> +                netdev_dpdk_remap_txqs(dev);
>              }
> -            netdev_dpdk_remap_txqs(dev);
>              exists = true;
>              ovs_mutex_unlock(&dev->mutex);
>              break;
> @@ -3624,9 +3667,9 @@ vring_state_changed(int vid, uint16_t queue_id, int enable)
>      ovs_mutex_unlock(&dpdk_mutex);
>  
>      if (exists) {
> -        VLOG_INFO("State of queue %d ( tx_qid %d ) of vhost device '%s' "
> -                  "changed to \'%s\'", queue_id, qid, ifname,
> -                  (enable == 1) ? "enabled" : "disabled");
> +        VLOG_INFO("State of queue %d ( %s_qid %d ) of vhost device '%s' "
> +                  "changed to \'%s\'", queue_id, is_rx == true ? "rx" : "tx",
> +                  qid, ifname, (enable == 1) ? "enabled" : "disabled");
>      } else {
>          VLOG_INFO("vHost Device '%s' not found", ifname);
>          return -1;
> @@ -4297,7 +4340,8 @@ static const struct netdev_class dpdk_vhost_class = {
>      .get_stats = netdev_dpdk_vhost_get_stats,
>      .get_status = netdev_dpdk_vhost_user_get_status,
>      .reconfigure = netdev_dpdk_vhost_reconfigure,
> -    .rxq_recv = netdev_dpdk_vhost_rxq_recv
> +    .rxq_recv = netdev_dpdk_vhost_rxq_recv,
> +    .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
>  };
>  
>  static const struct netdev_class dpdk_vhost_client_class = {
> @@ -4311,7 +4355,8 @@ static const struct netdev_class dpdk_vhost_client_class = {
>      .get_stats = netdev_dpdk_vhost_get_stats,
>      .get_status = netdev_dpdk_vhost_user_get_status,
>      .reconfigure = netdev_dpdk_vhost_client_reconfigure,
> -    .rxq_recv = netdev_dpdk_vhost_rxq_recv
> +    .rxq_recv = netdev_dpdk_vhost_rxq_recv,
> +    .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
>  };
>  
>  void
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index fb0c27e..39354db 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -789,6 +789,11 @@ struct netdev_class {
>      void (*rxq_destruct)(struct netdev_rxq *);
>      void (*rxq_dealloc)(struct netdev_rxq *);
>  
> +    /* A netdev can report if a queue won't get traffic and should be excluded
> +     * from polling (no callback implicitely means that the queue is enabled).
> +     */
> +    int (*rxq_enabled)(struct netdev_rxq *);
> +
>      /* Attempts to receive a batch of packets from 'rx'.  In 'batch', the
>       * caller supplies 'packets' as the pointer to the beginning of an array
>       * of NETDEV_MAX_BURST pointers to dp_packet.  If successful, the
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 7d7ecf6..88db6c7 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -682,6 +682,16 @@ netdev_rxq_close(struct netdev_rxq *rx)
>      }
>  }
>  
> +int netdev_rxq_enabled(struct netdev_rxq *rx)
> +{
> +    bool enabled = true;
> +
> +    if (rx->netdev->netdev_class->rxq_enabled) {
> +        enabled = rx->netdev->netdev_class->rxq_enabled(rx);
> +    }
> +    return enabled;

You're mixing and converting 'int' and 'bool' all the time.
I think, it's better to change the function prototype to return 'bool'.

> +}
> +
>  /* Attempts to receive a batch of packets from 'rx'.  'batch' should point to
>   * the beginning of an array of NETDEV_MAX_BURST pointers to dp_packet.  If
>   * successful, this function stores pointers to up to NETDEV_MAX_BURST
> diff --git a/lib/netdev.h b/lib/netdev.h
> index d94817f..859f5ef 100644
> --- a/lib/netdev.h
> +++ b/lib/netdev.h
> @@ -183,6 +183,7 @@ enum netdev_pt_mode netdev_get_pt_mode(const struct netdev *);
>  /* Packet reception. */
>  int netdev_rxq_open(struct netdev *, struct netdev_rxq **, int id);
>  void netdev_rxq_close(struct netdev_rxq *);
> +int netdev_rxq_enabled(struct netdev_rxq *);
>  
>  const char *netdev_rxq_get_name(const struct netdev_rxq *);
>  int netdev_rxq_get_queue_id(const struct netdev_rxq *);
>
David Marchand April 13, 2019, 6:36 p.m. UTC | #4
On Fri, Apr 12, 2019 at 12:00 PM Ilya Maximets <i.maximets@samsung.com>
wrote:

> On 11.04.2019 17:13, David Marchand wrote:
> >
> >
> > On Thu, Apr 11, 2019 at 4:02 PM David Marchand <
> david.marchand@redhat.com <mailto:david.marchand@redhat.com>> wrote:
> >
> >     We currently poll all available queues based on the max queue count
> >     exchanged with the vhost peer and rely on the vhost library in DPDK
> to
> >     check the vring status beneath.
> >     This can lead to some overhead when we have a lot of unused queues.
> >
> >     To enhance the situation, we can skip the disabled queues.
> >     On rxq notifications, we make use of the netdev's change_seq number
> so
> >     that the pmd thread main loop can cache the queue state periodically.
> >
> >     $ ovs-appctl dpif-netdev/pmd-rxq-show
> >     pmd thread numa_id 0 core_id 1:
> >       isolated : true
> >       port: dpdk0             queue-id:  0  pmd usage:  0 %  polling:
> enabled
> >     pmd thread numa_id 0 core_id 2:
> >       isolated : true
> >       port: vhost1            queue-id:  0  pmd usage:  0 %  polling:
> disabled
> >       port: vhost3            queue-id:  0  pmd usage:  0 %  polling:
> disabled
> >     pmd thread numa_id 0 core_id 15:
> >       isolated : true
> >       port: dpdk1             queue-id:  0  pmd usage:  0 %  polling:
> enabled
> >     pmd thread numa_id 0 core_id 16:
> >       isolated : true
> >       port: vhost0            queue-id:  0  pmd usage:  0 %  polling:
> disabled
> >       port: vhost2            queue-id:  0  pmd usage:  0 %  polling:
> disabled
> >
> >
> > This change broke the tests, since it matches the exact command output.
> > Is the output format something we must maintain ?
>
> I don't think so.
> You just need to fix tests to match with the new output format.
> However, in general, the output is already overloaded. Maybe it's
> worth to print polling state only if it's disabled?
> This should also save you from changing most of the tests.
>

Well, either updating the test or doing this is fine for me.
I will go with your suggestion.
David Marchand April 13, 2019, 6:49 p.m. UTC | #5
On Fri, Apr 12, 2019 at 2:12 PM Ilya Maximets <i.maximets@samsung.com>
wrote:

> On 11.04.2019 17:00, David Marchand wrote:
> > We currently poll all available queues based on the max queue count
> > exchanged with the vhost peer and rely on the vhost library in DPDK to
> > check the vring status beneath.
> > This can lead to some overhead when we have a lot of unused queues.
> >
> > To enhance the situation, we can skip the disabled queues.
> > On rxq notifications, we make use of the netdev's change_seq number so
> > that the pmd thread main loop can cache the queue state periodically.
> >
> > $ ovs-appctl dpif-netdev/pmd-rxq-show
> > pmd thread numa_id 0 core_id 1:
> >   isolated : true
> >   port: dpdk0             queue-id:  0  pmd usage:  0 %  polling: enabled
> > pmd thread numa_id 0 core_id 2:
> >   isolated : true
> >   port: vhost1            queue-id:  0  pmd usage:  0 %  polling:
> disabled
> >   port: vhost3            queue-id:  0  pmd usage:  0 %  polling:
> disabled
> > pmd thread numa_id 0 core_id 15:
> >   isolated : true
> >   port: dpdk1             queue-id:  0  pmd usage:  0 %  polling: enabled
> > pmd thread numa_id 0 core_id 16:
> >   isolated : true
> >   port: vhost0            queue-id:  0  pmd usage:  0 %  polling:
> disabled
> >   port: vhost2            queue-id:  0  pmd usage:  0 %  polling:
> disabled
> >
> > $ while true; do
> >   ovs-appctl dpif-netdev/pmd-rxq-show |awk '
> >   /polling: / {
> >     tot++;
> >     if ($NF == "enabled") {
> >       en++;
> >     }
> >   }
> >   END {
> >     print "total: " tot ", enabled: " en
> >   }'
> >   sleep 1
> > done
> >
> > total: 6, enabled: 2
> > total: 6, enabled: 2
> > ...
> >
> >  # Started vm, virtio devices are bound to kernel driver which enables
> >  # F_MQ + all queue pairs
> > total: 6, enabled: 2
> > total: 66, enabled: 66
> > ...
> >
> >  # Unbound vhost0 and vhost1 from the kernel driver
> > total: 66, enabled: 66
> > total: 66, enabled: 34
> > ...
> >
> >  # Configured kernel bound devices to use only 1 queue pair
> > total: 66, enabled: 34
> > total: 66, enabled: 19
> > total: 66, enabled: 4
> > ...
> >
> >  # While rebooting the vm
> > total: 66, enabled: 4
> > total: 66, enabled: 2
> > ...
> > total: 66, enabled: 66
> > ...
> >
> >  # After shutting down the vm
> > total: 66, enabled: 66
> > total: 66, enabled: 2
> >
> > Signed-off-by: David Marchand <david.marchand@redhat.com>
> > ---
>
> Hi.
> One important issue:
> You need to enable queue #0 in case there was no state changes.
> We're doing this for Tx queues in 'dpdk_vhost_reconfigure_helper'.
> This is because if F_PROTOCOL_FEATURES not negotiated, vrings are
> created in enabled state. Required to support older QEMU or legacy
> drivers.
>

Ah ok, that explains this txq0 special treatment.
Thanks.


> I'm working on DPDK patch to trigger 'vring_state_changed' in this
> case, however it could take long or never be accepted to LTS.
> I'd also like not to check for negotiated features in OVS code.
> I think this kind of stuff should be hidden by vhost library from
> the application.
>

Ok, we will have to keep this workaround anyway.



>
> More comments inline.
>
> Also, you might want to add Ian to CC list while sending userspace
> datapath related patches.
>

I dropped the previous Cc: list by mistake.
Sure will keep him Cc.



> >  lib/dpif-netdev.c     | 27 ++++++++++++++++++++
> >  lib/netdev-dpdk.c     | 71
> +++++++++++++++++++++++++++++++++++++++++----------
> >  lib/netdev-provider.h |  5 ++++
> >  lib/netdev.c          | 10 ++++++++
> >  lib/netdev.h          |  1 +
> >  5 files changed, 101 insertions(+), 13 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 4d6d0c3..c95612f 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -591,6 +591,8 @@ struct polled_queue {
> >      struct dp_netdev_rxq *rxq;
> >      odp_port_t port_no;
> >      bool emc_enabled;
> > +    bool enabled;
> > +    uint64_t change_seq;
> >  };
> >
> >  /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
> > @@ -1171,6 +1173,9 @@ pmd_info_show_rxq(struct ds *reply, struct
> dp_netdev_pmd_thread *pmd)
> >              } else {
> >                  ds_put_format(reply, "%s", "NOT AVAIL");
> >              }
> > +            ds_put_format(reply, "  polling: %s",
> > +                          (netdev_rxq_enabled(list[i].rxq->rx))
> > +                          ? "enabled" : "disabled");
> >              ds_put_cstr(reply, "\n");
> >          }
> >          ovs_mutex_unlock(&pmd->port_mutex);
> > @@ -5198,6 +5203,11 @@ dpif_netdev_run(struct dpif *dpif)
> >                  }
> >
> >                  for (i = 0; i < port->n_rxq; i++) {
> > +
> > +                    if (!netdev_rxq_enabled(port->rxqs[i].rx)) {
> > +                        continue;
> > +                    }
> > +
> >                      if (dp_netdev_process_rxq_port(non_pmd,
> >                                                     &port->rxqs[i],
> >                                                     port->port_no)) {
> > @@ -5371,6 +5381,9 @@ pmd_load_queues_and_ports(struct
> dp_netdev_pmd_thread *pmd,
> >          poll_list[i].rxq = poll->rxq;
> >          poll_list[i].port_no = poll->rxq->port->port_no;
> >          poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
> > +        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq->rx);
> > +        poll_list[i].change_seq =
> > +                     netdev_get_change_seq(poll->rxq->port->netdev);
> >          i++;
> >      }
> >
> > @@ -5436,6 +5449,10 @@ reload:
> >
> >          for (i = 0; i < poll_cnt; i++) {
> >
> > +            if (!poll_list[i].enabled) {
> > +                continue;
> > +            }
> > +
> >              if (poll_list[i].emc_enabled) {
> >                  atomic_read_relaxed(&pmd->dp->emc_insert_min,
> >                                      &pmd->ctx.emc_insert_min);
> > @@ -5472,6 +5489,16 @@ reload:
> >              if (reload) {
> >                  break;
> >              }
> > +
> > +            for (i = 0; i < poll_cnt; i++) {
> > +                uint64_t current_seq =
> > +
>  netdev_get_change_seq(poll_list[i].rxq->port->netdev);
> > +                if (poll_list[i].change_seq != current_seq) {
> > +                    poll_list[i].change_seq = current_seq;
> > +                    poll_list[i].enabled =
> > +
>  netdev_rxq_enabled(poll_list[i].rxq->rx);
> > +                }
> > +            }
> >          }
> >          pmd_perf_end_iteration(s, rx_packets, tx_packets,
> >                                 pmd_perf_metrics_enabled(pmd));
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index 47153dc..9088ff4 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -317,6 +317,10 @@ struct dpdk_mp {
> >       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
> >   };
> >
> > +struct dpdk_rx_queue {
> > +    bool enabled;
> > +};
>
> I'm thinking if we need to mimic tx_q here.
> rxq_enable is vhost specific. We could call an array like
> 'vhost_rxq_enabled'
> and drop the structure. 'free' of this array should happen in
> '*vhost_destruct',
> not in 'common_destruct'. Clearing could be done with 'memset' in this
> case.
>
> What do you think?
>

I had something similar in my first test, dropped it thinking I would align
on tx_q.
But indeed, tx_q is generic, while rx_q is really vhost specific.
We have no reason for a struct at the moment, so let's go with a simple
array.



> > +
> >  /* There should be one 'struct dpdk_tx_queue' created for
> >   * each cpu core. */
> >  struct dpdk_tx_queue {
> > @@ -424,6 +428,8 @@ struct netdev_dpdk {
> >          OVSRCU_TYPE(struct ingress_policer *) ingress_policer;
> >          uint32_t policer_rate;
> >          uint32_t policer_burst;
> > +
> > +        struct dpdk_rx_queue *rx_q;
>
> Comment would be nice here.
>

Sure.



> >      );
> >
> >      PADDED_MEMBERS(CACHE_LINE_SIZE,
> > @@ -1109,6 +1115,12 @@ netdev_dpdk_alloc(void)
> >      return NULL;
> >  }
> >
> > +static struct dpdk_rx_queue *
> > +netdev_dpdk_alloc_rxq(unsigned int n_rxqs)
> > +{
> > +    return dpdk_rte_mzalloc(n_rxqs * sizeof(struct dpdk_rx_queue));
> > +}
> > +
> >  static struct dpdk_tx_queue *
> >  netdev_dpdk_alloc_txq(unsigned int n_txqs)
> >  {
> > @@ -1235,6 +1247,10 @@ vhost_common_construct(struct netdev *netdev)
> >      int socket_id = rte_lcore_to_socket_id(rte_get_master_lcore());
> >      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >
> > +    dev->rx_q = netdev_dpdk_alloc_rxq(OVS_VHOST_MAX_QUEUE_NUM);
>
> Do we need a function for this? Allocating inplace you'll be able to
> use more OVS-style sizeof call:
>
>        dev->rx_q = dpdk_rte_mzalloc(OVS_VHOST_MAX_QUEUE_NUM * sizeof
> *dev->rx_q);
>

Ok.


> > +    if (!dev->rx_q) {
> > +        return ENOMEM;
> > +    }
> >      dev->tx_q = netdev_dpdk_alloc_txq(OVS_VHOST_MAX_QUEUE_NUM);
> >      if (!dev->tx_q) {
>
> Need to free 'rx_q' before return.
>

Indeed.


> >          return ENOMEM;
> > @@ -1354,6 +1370,7 @@ common_destruct(struct netdev_dpdk *dev)
> >      OVS_REQUIRES(dpdk_mutex)
> >      OVS_EXCLUDED(dev->mutex)
> >  {
> > +    rte_free(dev->rx_q);
> >      rte_free(dev->tx_q);
> >      dpdk_mp_put(dev->dpdk_mp);
> >
> > @@ -2201,6 +2218,14 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> >  }
> >
> >  static int
> > +netdev_dpdk_vhost_rxq_enabled(struct netdev_rxq *rxq)
> > +{
> > +    struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
> > +
> > +    return dev->rx_q[rxq->queue_id].enabled;
> > +}
> > +
> > +static int
> >  netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch
> *batch,
> >                       int *qfill)
> >  {
> > @@ -3531,6 +3556,17 @@ new_device(int vid)
> >
> >  /* Clears mapping for all available queues of vhost interface. */
> >  static void
> > +netdev_dpdk_rxq_map_clear(struct netdev_dpdk *dev)
>
> This isn't a 'map' actually.
>

I'll go with a simple memset as you suggested.


> > +    OVS_REQUIRES(dev->mutex)
> > +{
> > +    int i;
> > +
> > +    for (i = 0; i < dev->up.n_rxq; i++) {
> > +        dev->rx_q[i].enabled = false;
> > +    }
> > +}
> > +
> > +static void
> >  netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
> >      OVS_REQUIRES(dev->mutex)
> >  {
> > @@ -3563,6 +3599,7 @@ destroy_device(int vid)
> >              ovs_mutex_lock(&dev->mutex);
> >              dev->vhost_reconfigured = false;
> >              ovsrcu_index_set(&dev->vid, -1);
> > +            netdev_dpdk_rxq_map_clear(dev);
> >              netdev_dpdk_txq_map_clear(dev);
> >
> >              netdev_change_seq_changed(&dev->up);
> > @@ -3597,24 +3634,30 @@ vring_state_changed(int vid, uint16_t queue_id,
> int enable)
> >      struct netdev_dpdk *dev;
> >      bool exists = false;
> >      int qid = queue_id / VIRTIO_QNUM;
> > +    bool is_rx = (queue_id % VIRTIO_QNUM) == VIRTIO_TXQ;
> >      char ifname[IF_NAME_SZ];
> >
> >      rte_vhost_get_ifname(vid, ifname, sizeof ifname);
> >
> > -    if (queue_id % VIRTIO_QNUM == VIRTIO_TXQ) {
> > -        return 0;
> > -    }
> > -
> >      ovs_mutex_lock(&dpdk_mutex);
> >      LIST_FOR_EACH (dev, list_node, &dpdk_list) {
> >          ovs_mutex_lock(&dev->mutex);
> >          if (nullable_string_is_equal(ifname, dev->vhost_id)) {
> > -            if (enable) {
> > -                dev->tx_q[qid].map = qid;
> > +            if (is_rx) {
> > +                bool enabled = dev->rx_q[qid].enabled;
> > +
> > +                dev->rx_q[qid].enabled = enable != 0;
> > +                if (enabled ^ dev->rx_q[qid].enabled) {
>
> Bit ops on a boolean seems not a good practice. It's better to just use
> '!='.
>

Ok.


> > +                    netdev_change_seq_changed(&dev->up);
> > +                }
> >              } else {
> > -                dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
> > +                if (enable) {
> > +                    dev->tx_q[qid].map = qid;
> > +                } else {
> > +                    dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
> > +                }
> > +                netdev_dpdk_remap_txqs(dev);
> >              }
> > -            netdev_dpdk_remap_txqs(dev);
> >              exists = true;
> >              ovs_mutex_unlock(&dev->mutex);
> >              break;
> > @@ -3624,9 +3667,9 @@ vring_state_changed(int vid, uint16_t queue_id,
> int enable)
> >      ovs_mutex_unlock(&dpdk_mutex);
> >
> >      if (exists) {
> > -        VLOG_INFO("State of queue %d ( tx_qid %d ) of vhost device '%s'
> "
> > -                  "changed to \'%s\'", queue_id, qid, ifname,
> > -                  (enable == 1) ? "enabled" : "disabled");
> > +        VLOG_INFO("State of queue %d ( %s_qid %d ) of vhost device '%s'
> "
> > +                  "changed to \'%s\'", queue_id, is_rx == true ? "rx" :
> "tx",
> > +                  qid, ifname, (enable == 1) ? "enabled" : "disabled");
> >      } else {
> >          VLOG_INFO("vHost Device '%s' not found", ifname);
> >          return -1;
> > @@ -4297,7 +4340,8 @@ static const struct netdev_class dpdk_vhost_class
> = {
> >      .get_stats = netdev_dpdk_vhost_get_stats,
> >      .get_status = netdev_dpdk_vhost_user_get_status,
> >      .reconfigure = netdev_dpdk_vhost_reconfigure,
> > -    .rxq_recv = netdev_dpdk_vhost_rxq_recv
> > +    .rxq_recv = netdev_dpdk_vhost_rxq_recv,
> > +    .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
> >  };
> >
> >  static const struct netdev_class dpdk_vhost_client_class = {
> > @@ -4311,7 +4355,8 @@ static const struct netdev_class
> dpdk_vhost_client_class = {
> >      .get_stats = netdev_dpdk_vhost_get_stats,
> >      .get_status = netdev_dpdk_vhost_user_get_status,
> >      .reconfigure = netdev_dpdk_vhost_client_reconfigure,
> > -    .rxq_recv = netdev_dpdk_vhost_rxq_recv
> > +    .rxq_recv = netdev_dpdk_vhost_rxq_recv,
> > +    .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
> >  };
> >
> >  void
> > diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> > index fb0c27e..39354db 100644
> > --- a/lib/netdev-provider.h
> > +++ b/lib/netdev-provider.h
> > @@ -789,6 +789,11 @@ struct netdev_class {
> >      void (*rxq_destruct)(struct netdev_rxq *);
> >      void (*rxq_dealloc)(struct netdev_rxq *);
> >
> > +    /* A netdev can report if a queue won't get traffic and should be
> excluded
> > +     * from polling (no callback implicitely means that the queue is
> enabled).
> > +     */
> > +    int (*rxq_enabled)(struct netdev_rxq *);
> > +
> >      /* Attempts to receive a batch of packets from 'rx'.  In 'batch',
> the
> >       * caller supplies 'packets' as the pointer to the beginning of an
> array
> >       * of NETDEV_MAX_BURST pointers to dp_packet.  If successful, the
> > diff --git a/lib/netdev.c b/lib/netdev.c
> > index 7d7ecf6..88db6c7 100644
> > --- a/lib/netdev.c
> > +++ b/lib/netdev.c
> > @@ -682,6 +682,16 @@ netdev_rxq_close(struct netdev_rxq *rx)
> >      }
> >  }
> >
> > +int netdev_rxq_enabled(struct netdev_rxq *rx)
> > +{
> > +    bool enabled = true;
> > +
> > +    if (rx->netdev->netdev_class->rxq_enabled) {
> > +        enabled = rx->netdev->netdev_class->rxq_enabled(rx);
> > +    }
> > +    return enabled;
>
> You're mixing and converting 'int' and 'bool' all the time.
> I think, it's better to change the function prototype to return 'bool'.
>

Yes.



> > +}
> > +
> >  /* Attempts to receive a batch of packets from 'rx'.  'batch' should
> point to
> >   * the beginning of an array of NETDEV_MAX_BURST pointers to
> dp_packet.  If
> >   * successful, this function stores pointers to up to NETDEV_MAX_BURST
> > diff --git a/lib/netdev.h b/lib/netdev.h
> > index d94817f..859f5ef 100644
> > --- a/lib/netdev.h
> > +++ b/lib/netdev.h
> > @@ -183,6 +183,7 @@ enum netdev_pt_mode netdev_get_pt_mode(const struct
> netdev *);
> >  /* Packet reception. */
> >  int netdev_rxq_open(struct netdev *, struct netdev_rxq **, int id);
> >  void netdev_rxq_close(struct netdev_rxq *);
> > +int netdev_rxq_enabled(struct netdev_rxq *);
> >
> >  const char *netdev_rxq_get_name(const struct netdev_rxq *);
> >  int netdev_rxq_get_queue_id(const struct netdev_rxq *);
> >
>


Thanks a lot for the review, will send v2 on monday if all goes well.
diff mbox series

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 4d6d0c3..c95612f 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -591,6 +591,8 @@  struct polled_queue {
     struct dp_netdev_rxq *rxq;
     odp_port_t port_no;
     bool emc_enabled;
+    bool enabled;
+    uint64_t change_seq;
 };
 
 /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
@@ -1171,6 +1173,9 @@  pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd)
             } else {
                 ds_put_format(reply, "%s", "NOT AVAIL");
             }
+            ds_put_format(reply, "  polling: %s",
+                          (netdev_rxq_enabled(list[i].rxq->rx))
+                          ? "enabled" : "disabled");
             ds_put_cstr(reply, "\n");
         }
         ovs_mutex_unlock(&pmd->port_mutex);
@@ -5198,6 +5203,11 @@  dpif_netdev_run(struct dpif *dpif)
                 }
 
                 for (i = 0; i < port->n_rxq; i++) {
+
+                    if (!netdev_rxq_enabled(port->rxqs[i].rx)) {
+                        continue;
+                    }
+
                     if (dp_netdev_process_rxq_port(non_pmd,
                                                    &port->rxqs[i],
                                                    port->port_no)) {
@@ -5371,6 +5381,9 @@  pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
         poll_list[i].rxq = poll->rxq;
         poll_list[i].port_no = poll->rxq->port->port_no;
         poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
+        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq->rx);
+        poll_list[i].change_seq =
+                     netdev_get_change_seq(poll->rxq->port->netdev);
         i++;
     }
 
@@ -5436,6 +5449,10 @@  reload:
 
         for (i = 0; i < poll_cnt; i++) {
 
+            if (!poll_list[i].enabled) {
+                continue;
+            }
+
             if (poll_list[i].emc_enabled) {
                 atomic_read_relaxed(&pmd->dp->emc_insert_min,
                                     &pmd->ctx.emc_insert_min);
@@ -5472,6 +5489,16 @@  reload:
             if (reload) {
                 break;
             }
+
+            for (i = 0; i < poll_cnt; i++) {
+                uint64_t current_seq =
+                         netdev_get_change_seq(poll_list[i].rxq->port->netdev);
+                if (poll_list[i].change_seq != current_seq) {
+                    poll_list[i].change_seq = current_seq;
+                    poll_list[i].enabled =
+                                 netdev_rxq_enabled(poll_list[i].rxq->rx);
+                }
+            }
         }
         pmd_perf_end_iteration(s, rx_packets, tx_packets,
                                pmd_perf_metrics_enabled(pmd));
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 47153dc..9088ff4 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -317,6 +317,10 @@  struct dpdk_mp {
      struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
  };
 
+struct dpdk_rx_queue {
+    bool enabled;
+};
+
 /* There should be one 'struct dpdk_tx_queue' created for
  * each cpu core. */
 struct dpdk_tx_queue {
@@ -424,6 +428,8 @@  struct netdev_dpdk {
         OVSRCU_TYPE(struct ingress_policer *) ingress_policer;
         uint32_t policer_rate;
         uint32_t policer_burst;
+
+        struct dpdk_rx_queue *rx_q;
     );
 
     PADDED_MEMBERS(CACHE_LINE_SIZE,
@@ -1109,6 +1115,12 @@  netdev_dpdk_alloc(void)
     return NULL;
 }
 
+static struct dpdk_rx_queue *
+netdev_dpdk_alloc_rxq(unsigned int n_rxqs)
+{
+    return dpdk_rte_mzalloc(n_rxqs * sizeof(struct dpdk_rx_queue));
+}
+
 static struct dpdk_tx_queue *
 netdev_dpdk_alloc_txq(unsigned int n_txqs)
 {
@@ -1235,6 +1247,10 @@  vhost_common_construct(struct netdev *netdev)
     int socket_id = rte_lcore_to_socket_id(rte_get_master_lcore());
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 
+    dev->rx_q = netdev_dpdk_alloc_rxq(OVS_VHOST_MAX_QUEUE_NUM);
+    if (!dev->rx_q) {
+        return ENOMEM;
+    }
     dev->tx_q = netdev_dpdk_alloc_txq(OVS_VHOST_MAX_QUEUE_NUM);
     if (!dev->tx_q) {
         return ENOMEM;
@@ -1354,6 +1370,7 @@  common_destruct(struct netdev_dpdk *dev)
     OVS_REQUIRES(dpdk_mutex)
     OVS_EXCLUDED(dev->mutex)
 {
+    rte_free(dev->rx_q);
     rte_free(dev->tx_q);
     dpdk_mp_put(dev->dpdk_mp);
 
@@ -2201,6 +2218,14 @@  netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
 }
 
 static int
+netdev_dpdk_vhost_rxq_enabled(struct netdev_rxq *rxq)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
+
+    return dev->rx_q[rxq->queue_id].enabled;
+}
+
+static int
 netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch,
                      int *qfill)
 {
@@ -3531,6 +3556,17 @@  new_device(int vid)
 
 /* Clears mapping for all available queues of vhost interface. */
 static void
+netdev_dpdk_rxq_map_clear(struct netdev_dpdk *dev)
+    OVS_REQUIRES(dev->mutex)
+{
+    int i;
+
+    for (i = 0; i < dev->up.n_rxq; i++) {
+        dev->rx_q[i].enabled = false;
+    }
+}
+
+static void
 netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
     OVS_REQUIRES(dev->mutex)
 {
@@ -3563,6 +3599,7 @@  destroy_device(int vid)
             ovs_mutex_lock(&dev->mutex);
             dev->vhost_reconfigured = false;
             ovsrcu_index_set(&dev->vid, -1);
+            netdev_dpdk_rxq_map_clear(dev);
             netdev_dpdk_txq_map_clear(dev);
 
             netdev_change_seq_changed(&dev->up);
@@ -3597,24 +3634,30 @@  vring_state_changed(int vid, uint16_t queue_id, int enable)
     struct netdev_dpdk *dev;
     bool exists = false;
     int qid = queue_id / VIRTIO_QNUM;
+    bool is_rx = (queue_id % VIRTIO_QNUM) == VIRTIO_TXQ;
     char ifname[IF_NAME_SZ];
 
     rte_vhost_get_ifname(vid, ifname, sizeof ifname);
 
-    if (queue_id % VIRTIO_QNUM == VIRTIO_TXQ) {
-        return 0;
-    }
-
     ovs_mutex_lock(&dpdk_mutex);
     LIST_FOR_EACH (dev, list_node, &dpdk_list) {
         ovs_mutex_lock(&dev->mutex);
         if (nullable_string_is_equal(ifname, dev->vhost_id)) {
-            if (enable) {
-                dev->tx_q[qid].map = qid;
+            if (is_rx) {
+                bool enabled = dev->rx_q[qid].enabled;
+
+                dev->rx_q[qid].enabled = enable != 0;
+                if (enabled ^ dev->rx_q[qid].enabled) {
+                    netdev_change_seq_changed(&dev->up);
+                }
             } else {
-                dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
+                if (enable) {
+                    dev->tx_q[qid].map = qid;
+                } else {
+                    dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
+                }
+                netdev_dpdk_remap_txqs(dev);
             }
-            netdev_dpdk_remap_txqs(dev);
             exists = true;
             ovs_mutex_unlock(&dev->mutex);
             break;
@@ -3624,9 +3667,9 @@  vring_state_changed(int vid, uint16_t queue_id, int enable)
     ovs_mutex_unlock(&dpdk_mutex);
 
     if (exists) {
-        VLOG_INFO("State of queue %d ( tx_qid %d ) of vhost device '%s' "
-                  "changed to \'%s\'", queue_id, qid, ifname,
-                  (enable == 1) ? "enabled" : "disabled");
+        VLOG_INFO("State of queue %d ( %s_qid %d ) of vhost device '%s' "
+                  "changed to \'%s\'", queue_id, is_rx == true ? "rx" : "tx",
+                  qid, ifname, (enable == 1) ? "enabled" : "disabled");
     } else {
         VLOG_INFO("vHost Device '%s' not found", ifname);
         return -1;
@@ -4297,7 +4340,8 @@  static const struct netdev_class dpdk_vhost_class = {
     .get_stats = netdev_dpdk_vhost_get_stats,
     .get_status = netdev_dpdk_vhost_user_get_status,
     .reconfigure = netdev_dpdk_vhost_reconfigure,
-    .rxq_recv = netdev_dpdk_vhost_rxq_recv
+    .rxq_recv = netdev_dpdk_vhost_rxq_recv,
+    .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
 };
 
 static const struct netdev_class dpdk_vhost_client_class = {
@@ -4311,7 +4355,8 @@  static const struct netdev_class dpdk_vhost_client_class = {
     .get_stats = netdev_dpdk_vhost_get_stats,
     .get_status = netdev_dpdk_vhost_user_get_status,
     .reconfigure = netdev_dpdk_vhost_client_reconfigure,
-    .rxq_recv = netdev_dpdk_vhost_rxq_recv
+    .rxq_recv = netdev_dpdk_vhost_rxq_recv,
+    .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
 };
 
 void
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index fb0c27e..39354db 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -789,6 +789,11 @@  struct netdev_class {
     void (*rxq_destruct)(struct netdev_rxq *);
     void (*rxq_dealloc)(struct netdev_rxq *);
 
+    /* A netdev can report if a queue won't get traffic and should be excluded
+     * from polling (no callback implicitely means that the queue is enabled).
+     */
+    int (*rxq_enabled)(struct netdev_rxq *);
+
     /* Attempts to receive a batch of packets from 'rx'.  In 'batch', the
      * caller supplies 'packets' as the pointer to the beginning of an array
      * of NETDEV_MAX_BURST pointers to dp_packet.  If successful, the
diff --git a/lib/netdev.c b/lib/netdev.c
index 7d7ecf6..88db6c7 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -682,6 +682,16 @@  netdev_rxq_close(struct netdev_rxq *rx)
     }
 }
 
+int netdev_rxq_enabled(struct netdev_rxq *rx)
+{
+    bool enabled = true;
+
+    if (rx->netdev->netdev_class->rxq_enabled) {
+        enabled = rx->netdev->netdev_class->rxq_enabled(rx);
+    }
+    return enabled;
+}
+
 /* Attempts to receive a batch of packets from 'rx'.  'batch' should point to
  * the beginning of an array of NETDEV_MAX_BURST pointers to dp_packet.  If
  * successful, this function stores pointers to up to NETDEV_MAX_BURST
diff --git a/lib/netdev.h b/lib/netdev.h
index d94817f..859f5ef 100644
--- a/lib/netdev.h
+++ b/lib/netdev.h
@@ -183,6 +183,7 @@  enum netdev_pt_mode netdev_get_pt_mode(const struct netdev *);
 /* Packet reception. */
 int netdev_rxq_open(struct netdev *, struct netdev_rxq **, int id);
 void netdev_rxq_close(struct netdev_rxq *);
+int netdev_rxq_enabled(struct netdev_rxq *);
 
 const char *netdev_rxq_get_name(const struct netdev_rxq *);
 int netdev_rxq_get_queue_id(const struct netdev_rxq *);