diff mbox series

[ovs-dev,v8] dpif-netlink: distribute polling to discreet handlers

Message ID 20200909170821.40825-1-mark.d.gray@redhat.com
State Changes Requested
Headers show
Series [ovs-dev,v8] dpif-netlink: distribute polling to discreet handlers | expand

Commit Message

Mark Gray Sept. 9, 2020, 5:08 p.m. UTC
From: Aaron Conole <aconole@redhat.com>

Currently, the channel handlers are polled globally.  On some
systems, this causes a thundering herd issue where multiple
handler threads become active, only to do no work and immediately
sleep.

The approach here is to push the netlink socket channels to discreet
handler threads to process, rather than polling on every thread.
This will eliminate the need to wake multiple threads.

To check:

  ip netns add left
  ip netns add right
  ip link add center-left type veth peer name left0 netns left
  ip link add center-right type veth peer name right0 netns right
  ip link set center-left up
  ip link set center-right up
  ip -n left ip link set left0 up
  ip -n left ip addr add 172.31.110.10/24 dev left0
  ip -n right ip link set right0 up
  ip -n right ip addr add 172.31.110.11/24 dev right0

  ovs-vsctl add-br br0
  ovs-vsctl add-port br0 center-right
  ovs-vsctl add-port br0 center-left

  # in one terminal
  perf record -e sched:sched_wakeup,irq:softirq_entry -ag

  # in a separate terminal
  ip netns exec left arping -I left0 -c 1 172.31.110.11

  # in the perf terminal after exiting
  perf script

Look for the number of 'handler' threads which were made active.

Suggested-by: Ben Pfaff <blp@ovn.org>
Co-authored-by: Mark Gray <mark.d.gray@redhat.com>
Reported-by: David Ahern <dsahern@gmail.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
Cc: Matteo Croce <technoboy85@gmail.com>
Cc: Flavio Leitner <fbl@sysclose.org>
Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets")
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
---


v2: Oops - forgot to commit my whitespace cleanups.
v3: one port latency results as per Matteo's comments

    Stock:
      min/avg/max/mdev = 21.5/36.5/96.5/1.0 us
    With Patch:
      min/avg/max/mdev = 5.3/9.7/98.4/0.5 us 
v4: Oops - first email did not send
v5:   min/avg/max/mdev = 9.3/10.4/33.6/2.2 us
v6: Split out the AUTHORS patch and added a cover letter as
    per Ilya's suggestion.
    Fixed 0-day issues.
v7: Merged patches as per Flavio's suggestion. This is
    no longer a series. Fixed some other small issues.

 lib/dpif-netlink.c | 330 +++++++++++++++++++++++++--------------------
 lib/id-pool.c      |  10 ++
 lib/id-pool.h      |   1 +
 3 files changed, 197 insertions(+), 144 deletions(-)

Comments

Flavio Leitner Sept. 10, 2020, 1:23 p.m. UTC | #1
On Wed, Sep 09, 2020 at 06:08:21PM +0100, Mark Gray wrote:
> From: Aaron Conole <aconole@redhat.com>
> 
> Currently, the channel handlers are polled globally.  On some
> systems, this causes a thundering herd issue where multiple
> handler threads become active, only to do no work and immediately
> sleep.
> 
> The approach here is to push the netlink socket channels to discreet
> handler threads to process, rather than polling on every thread.
> This will eliminate the need to wake multiple threads.
> 
> To check:
> 
>   ip netns add left
>   ip netns add right
>   ip link add center-left type veth peer name left0 netns left
>   ip link add center-right type veth peer name right0 netns right
>   ip link set center-left up
>   ip link set center-right up
>   ip -n left ip link set left0 up
>   ip -n left ip addr add 172.31.110.10/24 dev left0
>   ip -n right ip link set right0 up
>   ip -n right ip addr add 172.31.110.11/24 dev right0
> 
>   ovs-vsctl add-br br0
>   ovs-vsctl add-port br0 center-right
>   ovs-vsctl add-port br0 center-left
> 
>   # in one terminal
>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
> 
>   # in a separate terminal
>   ip netns exec left arping -I left0 -c 1 172.31.110.11
> 
>   # in the perf terminal after exiting
>   perf script
> 
> Look for the number of 'handler' threads which were made active.
> 
> Suggested-by: Ben Pfaff <blp@ovn.org>
> Co-authored-by: Mark Gray <mark.d.gray@redhat.com>
> Reported-by: David Ahern <dsahern@gmail.com>
> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
> Cc: Matteo Croce <technoboy85@gmail.com>
> Cc: Flavio Leitner <fbl@sysclose.org>
> Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets")
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
> ---

Acked-by: Flavio Leitner <fbl@sysclose.org>
Thanks Aaron and Mark!
fbl
Mark Gray Sept. 10, 2020, 1:34 p.m. UTC | #2
On 10/09/2020 14:23, Flavio Leitner wrote:
> On Wed, Sep 09, 2020 at 06:08:21PM +0100, Mark Gray wrote:

> Acked-by: Flavio Leitner <fbl@sysclose.org>
> Thanks Aaron and Mark!
> fbl
> 

Thanks for the review.
Ilya Maximets Sept. 17, 2020, 6:43 p.m. UTC | #3
On 9/9/20 7:08 PM, Mark Gray wrote:
> From: Aaron Conole <aconole@redhat.com>
> 
> Currently, the channel handlers are polled globally.  On some
> systems, this causes a thundering herd issue where multiple
> handler threads become active, only to do no work and immediately
> sleep.
> 
> The approach here is to push the netlink socket channels to discreet
> handler threads to process, rather than polling on every thread.
> This will eliminate the need to wake multiple threads.
> 
> To check:
> 
>   ip netns add left
>   ip netns add right
>   ip link add center-left type veth peer name left0 netns left
>   ip link add center-right type veth peer name right0 netns right
>   ip link set center-left up
>   ip link set center-right up
>   ip -n left ip link set left0 up
>   ip -n left ip addr add 172.31.110.10/24 dev left0
>   ip -n right ip link set right0 up
>   ip -n right ip addr add 172.31.110.11/24 dev right0
> 
>   ovs-vsctl add-br br0
>   ovs-vsctl add-port br0 center-right
>   ovs-vsctl add-port br0 center-left
> 
>   # in one terminal
>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
> 
>   # in a separate terminal
>   ip netns exec left arping -I left0 -c 1 172.31.110.11
> 
>   # in the perf terminal after exiting
>   perf script
> 
> Look for the number of 'handler' threads which were made active.
> 
> Suggested-by: Ben Pfaff <blp@ovn.org>
> Co-authored-by: Mark Gray <mark.d.gray@redhat.com>
> Reported-by: David Ahern <dsahern@gmail.com>
> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
> Cc: Matteo Croce <technoboy85@gmail.com>
> Cc: Flavio Leitner <fbl@sysclose.org>
> Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets")
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
> ---
> 
> 
> v2: Oops - forgot to commit my whitespace cleanups.
> v3: one port latency results as per Matteo's comments
> 
>     Stock:
>       min/avg/max/mdev = 21.5/36.5/96.5/1.0 us
>     With Patch:
>       min/avg/max/mdev = 5.3/9.7/98.4/0.5 us 
> v4: Oops - first email did not send
> v5:   min/avg/max/mdev = 9.3/10.4/33.6/2.2 us
> v6: Split out the AUTHORS patch and added a cover letter as
>     per Ilya's suggestion.
>     Fixed 0-day issues.
> v7: Merged patches as per Flavio's suggestion. This is
>     no longer a series. Fixed some other small issues.
> 
>  lib/dpif-netlink.c | 330 +++++++++++++++++++++++++--------------------
>  lib/id-pool.c      |  10 ++
>  lib/id-pool.h      |   1 +
>  3 files changed, 197 insertions(+), 144 deletions(-)
> 
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 7da4fb54d..a74838506 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -37,6 +37,7 @@
>  #include "dpif-provider.h"
>  #include "fat-rwlock.h"
>  #include "flow.h"
> +#include "id-pool.h"
>  #include "netdev-linux.h"
>  #include "netdev-offload.h"
>  #include "netdev-provider.h"
> @@ -160,6 +161,9 @@ static void dpif_netlink_flow_to_dpif_flow(struct dpif_flow *,
>  
>  /* One of the dpif channels between the kernel and userspace. */
>  struct dpif_channel {
> +    struct hmap_node hmap_node;

This needs a comment and all other new structure fileds too.
Ex.:
    struct hmap_node hmap_node; /* In struct dpif_handler's channels hmap. */

> +    struct dpif_handler *handler;

Why do we need this back-pointer?  Maybe it's better to just add 'handler'
argument to dpif_channel_del() ?

> +    uint32_t port_idx;

I see a lot of 'port_idx' here and there, but it's not really an index anymore.
You removed all the arrays that used these variables as index.
Can we just use 'port_no' directly here?  I know that there will be conversions
from odp to u32, but this should be more clear.  If conversions are the issue,
we could store u32 version of 'port_no' here.

And, please, add a comment.

>      struct nl_sock *sock;       /* Netlink socket. */
>      long long int last_poll;    /* Last time this channel was polled. */
>  };
> @@ -183,6 +187,8 @@ struct dpif_handler {
>      int n_events;                 /* Num events returned by epoll_wait(). */
>      int event_offset;             /* Offset into 'epoll_events'. */
>  
> +    struct hmap channels;         /* map of channels for each port. */

Comment style: Capital letter.

> +
>  #ifdef _WIN32
>      /* Pool of sockets. */
>      struct dpif_windows_vport_sock *vport_sock_pool;
> @@ -201,9 +207,8 @@ struct dpif_netlink {
>      struct fat_rwlock upcall_lock;
>      struct dpif_handler *handlers;
>      uint32_t n_handlers;           /* Num of upcall handlers. */
> -    struct dpif_channel *channels; /* Array of channels for each port. */
> -    int uc_array_size;             /* Size of 'handler->channels' and */
> -                                   /* 'handler->epoll_events'. */
> +
> +    struct id_pool *port_ids;      /* Set of added port ids */

What is the reason of having this pool?  Datapath port numbers are unique
except for ODPP_NONE, and all the functions you implemented actually
iterates over all channels anyway to find one with requested port_idx.
So, the only purpose of it is to avoid searching on incorect port number,
but this should be a rare case.  Can we just remove it?

OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
instead as it already has all required infrastructure and doesn't actually
care about range of values.

>  
>      /* Change notification. */
>      struct nl_sock *port_notifier; /* vport multicast group subscriber. */
> @@ -287,6 +292,44 @@ close_nl_sock(struct nl_sock *sock)
>  #endif
>  }
>  
> +static size_t
> +dpif_handler_channels_count(const struct dpif_handler *handler)
> +{
> +    return hmap_count(&handler->channels);
> +}
> +
> +static struct dpif_channel *
> +dpif_handler_channel_find(struct dpif_handler *handler, uint32_t idx)
> +{
> +    struct dpif_channel *channel;
> +
> +    HMAP_FOR_EACH_WITH_HASH (channel, hmap_node, hash_int(idx, 0),
> +                             &handler->channels) {
> +        if (channel->port_idx == idx) {
> +            return channel;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static void
> +dpif_channel_del(struct dpif_netlink *dpif,
> +                 struct dpif_channel *channel)

As suggested above, it might make sense to just pass 'handler' here and
avoid having back-pointer inside the channel structure.

> +{
> +    struct dpif_handler *handler = channel->handler;
> +
> +#ifndef _WIN32
> +    epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
> +              nl_sock_fd(channel->sock), NULL);
> +    nl_sock_destroy(channel->sock);
> +#endif
> +    hmap_remove(&handler->channels, &channel->hmap_node);
> +    id_pool_free_id(dpif->port_ids, channel->port_idx);
> +    handler->event_offset = handler->n_events = 0;
> +    free(channel);
> +}
> +
>  static struct dpif_netlink *
>  dpif_netlink_cast(const struct dpif *dpif)
>  {
> @@ -385,6 +428,7 @@ open_dpif(const struct dpif_netlink_dp *dp, struct dpif **dpifp)
>  
>      dpif->dp_ifindex = dp->dp_ifindex;
>      dpif->user_features = dp->user_features;
> +    dpif->port_ids = id_pool_create(0, MAX_PORTS);
>      *dpifp = &dpif->dpif;
>  
>      return 0;
> @@ -446,167 +490,151 @@ error:
>  }
>  #endif /* _WIN32 */
>  
> -/* Given the port number 'port_idx', extracts the pid of netlink socket
> +/* Given the port number 'port_no', extracts the pid of netlink socket
>   * associated to the port and assigns it to 'upcall_pid'. */
>  static bool
> -vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx,
> +vport_get_pid(struct dpif_netlink *dpif, odp_port_t port_no,
>                uint32_t *upcall_pid)

This function seems redundant as it does almost same thing as
dpif_netlink_port_get_pid__().  Can we combine them?

>  {
> -    /* Since the nl_sock can only be assigned in either all
> -     * or none "dpif" channels, the following check
> -     * would suffice. */
> -    if (!dpif->channels[port_idx].sock) {
> +    uint32_t port_idx = odp_to_u32(port_no);
> +    size_t i;
> +
> +    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
> +
> +    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
>          return false;
>      }
> -    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
>  
> -    *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock);
> +    /* The 'port_idx' should only be valid for a single handler. */
> +    for (i = 0; i < dpif->n_handlers; i++) {
> +        struct dpif_channel *channel;
> +
> +        channel = dpif_handler_channel_find(&dpif->handlers[i], port_idx);
> +        if (channel) {
> +            *upcall_pid = nl_sock_pid(channel->sock);
> +            return true;
> +        }
> +    }
>  
> -    return true;
> +    return false;
>  }
>  
>  static int
>  vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
>                    struct nl_sock *sock)
>  {
> -    struct epoll_event event;
>      uint32_t port_idx = odp_to_u32(port_no);
> +    struct dpif_channel *channel;
> +    struct dpif_handler *handler;
> +    struct epoll_event event;
>      size_t i;
> -    int error;
>  
> -    if (dpif->handlers == NULL) {
> +    if (dpif->handlers == NULL ||
> +        id_pool_has_id(dpif->port_ids, port_idx)) {
>          close_nl_sock(sock);
> -        return 0;
> +        return EINVAL;
>      }
>  
> -    /* We assume that the datapath densely chooses port numbers, which can
> -     * therefore be used as an index into 'channels' and 'epoll_events' of
> -     * 'dpif'. */
> -    if (port_idx >= dpif->uc_array_size) {
> -        uint32_t new_size = port_idx + 1;
> -
> -        if (new_size > MAX_PORTS) {
> -            VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
> -                         dpif_name(&dpif->dpif), port_no);
> -            return EFBIG;
> -        }
> +    if (port_idx >= MAX_PORTS) {
> +        VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
> +                     dpif_name(&dpif->dpif), port_no);

This looks a bit wierd since we're checking port_idx and printing port_no
instead.  Might be refined if we will just use port_no everywhere.

> +        return EFBIG;
> +    }
>  
> -        dpif->channels = xrealloc(dpif->channels,
> -                                  new_size * sizeof *dpif->channels);
> +    handler = &dpif->handlers[0];
>  
> -        for (i = dpif->uc_array_size; i < new_size; i++) {
> -            dpif->channels[i].sock = NULL;
> +    for (i = 0; i < dpif->n_handlers; i++) {
> +        if (dpif_handler_channels_count(&dpif->handlers[i]) <
> +            dpif_handler_channels_count(handler)) {
> +            handler = &dpif->handlers[i];
>          }
> +    }
>  
> -        for (i = 0; i < dpif->n_handlers; i++) {
> -            struct dpif_handler *handler = &dpif->handlers[i];
> +    channel = xmalloc(sizeof *channel);
> +    channel->port_idx = port_idx;
> +    channel->sock = sock;
> +    channel->last_poll = LLONG_MIN;
> +    channel->handler = handler;
> +    hmap_insert(&handler->channels, &channel->hmap_node,
> +                hash_int(port_idx, 0));
>  
> -            handler->epoll_events = xrealloc(handler->epoll_events,
> -                new_size * sizeof *handler->epoll_events);
> -
> -        }
> -        dpif->uc_array_size = new_size;
> -    }
> +    handler->epoll_events = xrealloc(handler->epoll_events,
> +                                     dpif_handler_channels_count(handler) *
> +                                     sizeof *handler->epoll_events);
>  
>      memset(&event, 0, sizeof event);
>      event.events = EPOLLIN | EPOLLEXCLUSIVE;
>      event.data.u32 = port_idx;
>  
> -    for (i = 0; i < dpif->n_handlers; i++) {
> -        struct dpif_handler *handler = &dpif->handlers[i];
> -
>  #ifndef _WIN32
> -        if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
> -                      &event) < 0) {
> -            error = errno;
> -            goto error;
> -        }
> -#endif
> -    }
> -    dpif->channels[port_idx].sock = sock;
> -    dpif->channels[port_idx].last_poll = LLONG_MIN;
> -
> -    return 0;
> -
> -error:
> -#ifndef _WIN32
> -    while (i--) {
> -        epoll_ctl(dpif->handlers[i].epoll_fd, EPOLL_CTL_DEL,
> -                  nl_sock_fd(sock), NULL);
> +    if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
> +                  &event) < 0) {
> +        hmap_remove(&handler->channels, &channel->hmap_node);
> +        free(channel);
> +        id_pool_free_id(dpif->port_ids, port_idx);
> +        return errno;
>      }
>  #endif
> -    dpif->channels[port_idx].sock = NULL;
>  
> -    return error;
> +    id_pool_add(dpif->port_ids, port_idx);
> +    return 0;
>  }
>  
> -static void
> -vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no)
> +static bool
> +vport_del_channel(struct dpif_netlink *dpif, odp_port_t port_no)
>  {
> -    uint32_t port_idx = odp_to_u32(port_no);
>      size_t i;
>  
> -    if (!dpif->handlers || port_idx >= dpif->uc_array_size
> -        || !dpif->channels[port_idx].sock) {
> -        return;
> -    }
> -
>      for (i = 0; i < dpif->n_handlers; i++) {
> -        struct dpif_handler *handler = &dpif->handlers[i];
> -#ifndef _WIN32
> -        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
> -                  nl_sock_fd(dpif->channels[port_idx].sock), NULL);
> -#endif
> -        handler->event_offset = handler->n_events = 0;
> +        struct dpif_channel *channel;
> +
> +        channel = dpif_handler_channel_find(&dpif->handlers[i],
> +                                            odp_to_u32(port_no));
> +        if (channel) {
> +            dpif_channel_del(dpif, channel);
> +            return true;
> +        }
>      }
> -#ifndef _WIN32
> -    nl_sock_destroy(dpif->channels[port_idx].sock);
> -#endif
> -    dpif->channels[port_idx].sock = NULL;
> +
> +    return false;
>  }
>  
>  static void
>  destroy_all_channels(struct dpif_netlink *dpif)
>      OVS_REQ_WRLOCK(dpif->upcall_lock)
>  {
> -    unsigned int i;
> +    size_t i;
>  
>      if (!dpif->handlers) {
>          return;
>      }
>  
> -    for (i = 0; i < dpif->uc_array_size; i++ ) {
> -        struct dpif_netlink_vport vport_request;
> -        uint32_t upcall_pids = 0;
> -
> -        if (!dpif->channels[i].sock) {
> -            continue;
> -        }
> -
> -        /* Turn off upcalls. */
> -        dpif_netlink_vport_init(&vport_request);
> -        vport_request.cmd = OVS_VPORT_CMD_SET;
> -        vport_request.dp_ifindex = dpif->dp_ifindex;
> -        vport_request.port_no = u32_to_odp(i);
> -        vport_request.n_upcall_pids = 1;
> -        vport_request.upcall_pids = &upcall_pids;
> -        dpif_netlink_vport_transact(&vport_request, NULL, NULL);
> -
> -        vport_del_channels(dpif, u32_to_odp(i));
> -    }
> -
>      for (i = 0; i < dpif->n_handlers; i++) {
>          struct dpif_handler *handler = &dpif->handlers[i];
> +        struct dpif_channel *channel, *next;
> +
> +        HMAP_FOR_EACH_SAFE (channel, next, hmap_node, &handler->channels) {
> +            struct dpif_netlink_vport vport_request;
> +            uint32_t upcall_pids = 0;

Empty line, please.

> +            /* Turn off upcalls. */
> +            dpif_netlink_vport_init(&vport_request);
> +            vport_request.cmd = OVS_VPORT_CMD_SET;
> +            vport_request.dp_ifindex = dpif->dp_ifindex;
> +            vport_request.port_no = u32_to_odp(channel->port_idx);
> +            vport_request.n_upcall_pids = 1;
> +            vport_request.upcall_pids = &upcall_pids;
> +            dpif_netlink_vport_transact(&vport_request, NULL, NULL);
> +            ovs_assert(channel != NULL);
> +            dpif_channel_del(dpif, channel);
> +        }
>  
>          dpif_netlink_handler_uninit(handler);
>          free(handler->epoll_events);
>      }
> -    free(dpif->channels);
> +
>      free(dpif->handlers);
>      dpif->handlers = NULL;
> -    dpif->channels = NULL;
>      dpif->n_handlers = 0;
> -    dpif->uc_array_size = 0;
>  }
>  
>  static void
> @@ -618,9 +646,11 @@ dpif_netlink_close(struct dpif *dpif_)
>  
>      fat_rwlock_wrlock(&dpif->upcall_lock);
>      destroy_all_channels(dpif);
> +    id_pool_destroy(dpif->port_ids);
>      fat_rwlock_unlock(&dpif->upcall_lock);
>  
>      fat_rwlock_destroy(&dpif->upcall_lock);
> +
>      free(dpif);
>  }
>  
> @@ -1003,7 +1033,9 @@ dpif_netlink_port_del__(struct dpif_netlink *dpif, odp_port_t port_no)
>  #endif
>      error = dpif_netlink_vport_transact(&vport, NULL, NULL);
>  
> -    vport_del_channels(dpif, port_no);
> +    if (!vport_del_channel(dpif, port_no)) {

If packets receiving disabled with dpif_recv_set(), there will be no channels
and this check will fail.  And if it we will return here we will leave tunnel
port in a system which is not a good thing.

And we will also leak 'dpif_port' here.

> +        return EINVAL;
> +    }
>  
>      if (!error && !ovs_tunnels_out_of_tree) {
>          error = dpif_netlink_rtnl_port_destroy(dpif_port.name, dpif_port.type);
> @@ -1084,23 +1116,39 @@ dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif,
>                              odp_port_t port_no)
>      OVS_REQ_RDLOCK(dpif->upcall_lock)
>  {
> +    struct dpif_channel *min_channel = NULL, *found_channel = NULL;
>      uint32_t port_idx = odp_to_u32(port_no);
>      uint32_t pid = 0;
> +    size_t i;
> +
> +    /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
> +     * channel, since it is not heavily loaded. */
> +    uint32_t idx = port_no == ODPP_NONE ? 0 : port_idx;
> +
> +    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
> +        return 0;
> +    }
> +
> +    if (dpif->handlers) {
> +        for (i = 0; i < dpif->n_handlers; i++) {
> +            if (dpif_handler_channel_find(&dpif->handlers[i], idx)) {
> +                found_channel = dpif_handler_channel_find(&dpif->handlers[i],
> +                                                          idx);

Looking up twice seems inefficient.  Same for the 0 case below.

> +                break;
> +            }
>  
> -    if (dpif->handlers && dpif->uc_array_size > 0) {
> -        /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
> -         * channel, since it is not heavily loaded. */
> -        uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx;
> -
> -        /* Needs to check in case the socket pointer is changed in between
> -         * the holding of upcall_lock.  A known case happens when the main
> -         * thread deletes the vport while the handler thread is handling
> -         * the upcall from that port. */
> -        if (dpif->channels[idx].sock) {
> -            pid = nl_sock_pid(dpif->channels[idx].sock);
> +            if (dpif_handler_channel_find(&dpif->handlers[i], 0)) {
> +                min_channel = dpif_handler_channel_find(&dpif->handlers[i], 0);
> +            }
>          }
>      }
>  
> +    if (found_channel) {
> +        pid = nl_sock_pid(found_channel->sock);
> +    } else if (min_channel) {
> +        pid = nl_sock_pid(min_channel->sock);

This might be a question for Aaron mostly.

Could you, please, elaborate why we have a behavior change here?
Old code just returnd 0 as a pid, but after this change we will return
pid of a channel of port 0.

> +    }
> +
>      return pid;
>  }
>  
> @@ -2342,12 +2390,14 @@ dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops,
>  static void
>  dpif_netlink_handler_uninit(struct dpif_handler *handler)
>  {
> +    hmap_destroy(&handler->channels);
>      vport_delete_sock_pool(handler);
>  }
>  
>  static int
>  dpif_netlink_handler_init(struct dpif_handler *handler)
>  {
> +    hmap_init(&handler->channels);
>      return vport_create_sock_pool(handler);
>  }
>  #else
> @@ -2355,6 +2405,7 @@ dpif_netlink_handler_init(struct dpif_handler *handler)
>  static int
>  dpif_netlink_handler_init(struct dpif_handler *handler)
>  {
> +    hmap_init(&handler->channels);
>      handler->epoll_fd = epoll_create(10);
>      return handler->epoll_fd < 0 ? errno : 0;
>  }
> @@ -2362,6 +2413,7 @@ dpif_netlink_handler_init(struct dpif_handler *handler)
>  static void
>  dpif_netlink_handler_uninit(struct dpif_handler *handler)
>  {
> +    hmap_destroy(&handler->channels);
>      close(handler->epoll_fd);
>  }
>  #endif
> @@ -2374,9 +2426,7 @@ static int
>  dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
>      OVS_REQ_WRLOCK(dpif->upcall_lock)
>  {
> -    unsigned long int *keep_channels;
>      struct dpif_netlink_vport vport;
> -    size_t keep_channels_nbits;
>      struct nl_dump dump;
>      uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
>      struct ofpbuf buf;
> @@ -2412,22 +2462,16 @@ dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
>  
>      for (i = 0; i < n_handlers; i++) {
>          struct dpif_handler *handler = &dpif->handlers[i];
> -

Please, keep.

>          handler->event_offset = handler->n_events = 0;
>      }
>  
> -    keep_channels_nbits = dpif->uc_array_size;
> -    keep_channels = bitmap_allocate(keep_channels_nbits);
> -
>      ofpbuf_use_stub(&buf, reply_stub, sizeof reply_stub);
>      dpif_netlink_port_dump_start__(dpif, &dump);
>      while (!dpif_netlink_port_dump_next__(dpif, &dump, &vport, &buf)) {
> -        uint32_t port_no = odp_to_u32(vport.port_no);
>          uint32_t upcall_pid;
>          int error;
>  
> -        if (port_no >= dpif->uc_array_size
> -            || !vport_get_pid(dpif, port_no, &upcall_pid)) {
> +        if (!vport_get_pid(dpif, vport.port_no, &upcall_pid)) {
>              struct nl_sock *sock;
>              error = create_nl_sock(dpif, &sock);
>  
> @@ -2475,25 +2519,15 @@ dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
>              }
>          }
>  
> -        if (port_no < keep_channels_nbits) {
> -            bitmap_set1(keep_channels, port_no);
> -        }
>          continue;
>  
>      error:
> -        vport_del_channels(dpif, vport.port_no);
> +        vport_del_channel(dpif, vport.port_no);
>      }
> +
>      nl_dump_done(&dump);
>      ofpbuf_uninit(&buf);
>  
> -    /* Discard any saved channels that we didn't reuse. */
> -    for (i = 0; i < keep_channels_nbits; i++) {
> -        if (!bitmap_is_set(keep_channels, i)) {
> -            vport_del_channels(dpif, u32_to_odp(i));
> -        }
> -    }
> -    free(keep_channels);
> -
>      return retval;
>  }
>  
> @@ -2714,6 +2748,8 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>      OVS_REQ_RDLOCK(dpif->upcall_lock)
>  {
>      struct dpif_handler *handler;
> +    size_t n_channels;
> +
>      int read_tries = 0;
>  
>      if (!dpif->handlers || handler_id >= dpif->n_handlers) {
> @@ -2721,14 +2757,15 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>      }
>  
>      handler = &dpif->handlers[handler_id];
> -    if (handler->event_offset >= handler->n_events) {
> +    n_channels = dpif_handler_channels_count(handler);
> +    if (handler->event_offset >= handler->n_events && n_channels) {
>          int retval;
>  
>          handler->event_offset = handler->n_events = 0;
>  
>          do {
>              retval = epoll_wait(handler->epoll_fd, handler->epoll_events,
> -                                dpif->uc_array_size, 0);
> +                                n_channels, 0);
>          } while (retval < 0 && errno == EINTR);
>  
>          if (retval < 0) {
> @@ -2741,7 +2778,10 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>  
>      while (handler->event_offset < handler->n_events) {
>          int idx = handler->epoll_events[handler->event_offset].data.u32;
> -        struct dpif_channel *ch = &dpif->channels[idx];
> +        struct dpif_channel *ch = dpif_handler_channel_find(handler, idx);
> +        if (!ch) {
> +            return EAGAIN;

How can it be that handler received event on port that it doesn't have channel for?
Maybe assertion will be better here?

> +        }
>  
>          handler->event_offset++;
>  
> @@ -2845,12 +2885,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif)
>      if (dpif->handlers) {
>          size_t i;
>  
> -        if (!dpif->channels[0].sock) {
> -            return;
> -        }
> -        for (i = 0; i < dpif->uc_array_size; i++ ) {
> +        for (i = 0; i < dpif->n_handlers; i++) {
> +            struct dpif_handler *handler = &dpif->handlers[i];
> +            struct dpif_channel *channel;
> +
> +            HMAP_FOR_EACH (channel, hmap_node, &handler->channels) {
> +                nl_sock_drain(channel->sock);
> +            }
>  
> -            nl_sock_drain(dpif->channels[i].sock);
>          }
>      }
>  }
> diff --git a/lib/id-pool.c b/lib/id-pool.c
> index 69910ad08..bef822f6b 100644
> --- a/lib/id-pool.c
> +++ b/lib/id-pool.c
> @@ -93,6 +93,16 @@ id_pool_find(struct id_pool *pool, uint32_t id)
>      return NULL;
>  }
>  
> +bool
> +id_pool_has_id(struct id_pool *pool, uint32_t id)
> +{
> +    if (!id_pool_find(pool, id)) {
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
>  void
>  id_pool_add(struct id_pool *pool, uint32_t id)
>  {
> diff --git a/lib/id-pool.h b/lib/id-pool.h
> index 8721f8793..62876e2a5 100644
> --- a/lib/id-pool.h
> +++ b/lib/id-pool.h
> @@ -29,6 +29,7 @@ void id_pool_destroy(struct id_pool *);
>  bool id_pool_alloc_id(struct id_pool *, uint32_t *id);
>  void id_pool_free_id(struct id_pool *, uint32_t id);
>  void id_pool_add(struct id_pool *, uint32_t id);
> +bool id_pool_has_id(struct id_pool *, uint32_t id);
>  
>  /*
>   * ID pool.
>
Flavio Leitner Sept. 17, 2020, 7:23 p.m. UTC | #4
On Thu, Sep 17, 2020 at 08:43:38PM +0200, Ilya Maximets wrote:
> On 9/9/20 7:08 PM, Mark Gray wrote:
> > From: Aaron Conole <aconole@redhat.com>
> > 
> > Currently, the channel handlers are polled globally.  On some
> > systems, this causes a thundering herd issue where multiple
> > handler threads become active, only to do no work and immediately
> > sleep.
> > 
> > The approach here is to push the netlink socket channels to discreet
> > handler threads to process, rather than polling on every thread.
> > This will eliminate the need to wake multiple threads.
> > 
> > To check:
> > 
> >   ip netns add left
> >   ip netns add right
> >   ip link add center-left type veth peer name left0 netns left
> >   ip link add center-right type veth peer name right0 netns right
> >   ip link set center-left up
> >   ip link set center-right up
> >   ip -n left ip link set left0 up
> >   ip -n left ip addr add 172.31.110.10/24 dev left0
> >   ip -n right ip link set right0 up
> >   ip -n right ip addr add 172.31.110.11/24 dev right0
> > 
> >   ovs-vsctl add-br br0
> >   ovs-vsctl add-port br0 center-right
> >   ovs-vsctl add-port br0 center-left
> > 
> >   # in one terminal
> >   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
> > 
> >   # in a separate terminal
> >   ip netns exec left arping -I left0 -c 1 172.31.110.11
> > 
> >   # in the perf terminal after exiting
> >   perf script
> > 
> > Look for the number of 'handler' threads which were made active.
> > 
> > Suggested-by: Ben Pfaff <blp@ovn.org>
> > Co-authored-by: Mark Gray <mark.d.gray@redhat.com>
> > Reported-by: David Ahern <dsahern@gmail.com>
> > Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
> > Cc: Matteo Croce <technoboy85@gmail.com>
> > Cc: Flavio Leitner <fbl@sysclose.org>
> > Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets")
> > Signed-off-by: Aaron Conole <aconole@redhat.com>
> > Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
> > ---
> > 
> > 
> > v2: Oops - forgot to commit my whitespace cleanups.
> > v3: one port latency results as per Matteo's comments
> > 
> >     Stock:
> >       min/avg/max/mdev = 21.5/36.5/96.5/1.0 us
> >     With Patch:
> >       min/avg/max/mdev = 5.3/9.7/98.4/0.5 us 
> > v4: Oops - first email did not send
> > v5:   min/avg/max/mdev = 9.3/10.4/33.6/2.2 us
> > v6: Split out the AUTHORS patch and added a cover letter as
> >     per Ilya's suggestion.
> >     Fixed 0-day issues.
> > v7: Merged patches as per Flavio's suggestion. This is
> >     no longer a series. Fixed some other small issues.
> > 
> >  lib/dpif-netlink.c | 330 +++++++++++++++++++++++++--------------------
> >  lib/id-pool.c      |  10 ++
> >  lib/id-pool.h      |   1 +
> >  3 files changed, 197 insertions(+), 144 deletions(-)
> > 
> > diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> > index 7da4fb54d..a74838506 100644
> > --- a/lib/dpif-netlink.c
> > +++ b/lib/dpif-netlink.c
> > @@ -37,6 +37,7 @@
> >  #include "dpif-provider.h"
> >  #include "fat-rwlock.h"
> >  #include "flow.h"
> > +#include "id-pool.h"
> >  #include "netdev-linux.h"
> >  #include "netdev-offload.h"
> >  #include "netdev-provider.h"
> > @@ -160,6 +161,9 @@ static void dpif_netlink_flow_to_dpif_flow(struct dpif_flow *,
> >  
> >  /* One of the dpif channels between the kernel and userspace. */
> >  struct dpif_channel {
> > +    struct hmap_node hmap_node;
> 
> This needs a comment and all other new structure fileds too.
> Ex.:
>     struct hmap_node hmap_node; /* In struct dpif_handler's channels hmap. */
> 
> > +    struct dpif_handler *handler;
> 
> Why do we need this back-pointer?  Maybe it's better to just add 'handler'
> argument to dpif_channel_del() ?
> 
> > +    uint32_t port_idx;
> 
> I see a lot of 'port_idx' here and there, but it's not really an index anymore.
> You removed all the arrays that used these variables as index.
> Can we just use 'port_no' directly here?  I know that there will be conversions
> from odp to u32, but this should be more clear.  If conversions are the issue,
> we could store u32 version of 'port_no' here.
> 
> And, please, add a comment.
> 
> >      struct nl_sock *sock;       /* Netlink socket. */
> >      long long int last_poll;    /* Last time this channel was polled. */
> >  };
> > @@ -183,6 +187,8 @@ struct dpif_handler {
> >      int n_events;                 /* Num events returned by epoll_wait(). */
> >      int event_offset;             /* Offset into 'epoll_events'. */
> >  
> > +    struct hmap channels;         /* map of channels for each port. */
> 
> Comment style: Capital letter.
> 
> > +
> >  #ifdef _WIN32
> >      /* Pool of sockets. */
> >      struct dpif_windows_vport_sock *vport_sock_pool;
> > @@ -201,9 +207,8 @@ struct dpif_netlink {
> >      struct fat_rwlock upcall_lock;
> >      struct dpif_handler *handlers;
> >      uint32_t n_handlers;           /* Num of upcall handlers. */
> > -    struct dpif_channel *channels; /* Array of channels for each port. */
> > -    int uc_array_size;             /* Size of 'handler->channels' and */
> > -                                   /* 'handler->epoll_events'. */
> > +
> > +    struct id_pool *port_ids;      /* Set of added port ids */
> 
> What is the reason of having this pool?  Datapath port numbers are unique
> except for ODPP_NONE, and all the functions you implemented actually
> iterates over all channels anyway to find one with requested port_idx.
> So, the only purpose of it is to avoid searching on incorect port number,
> but this should be a rare case.  Can we just remove it?
> 
> OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
> instead as it already has all required infrastructure and doesn't actually
> care about range of values.

Well, now it iterates over handlers looking for a channel that has
a port, so maybe it's better to have ports hashed so we can get the
channel?

fbl

> 
> >  
> >      /* Change notification. */
> >      struct nl_sock *port_notifier; /* vport multicast group subscriber. */
> > @@ -287,6 +292,44 @@ close_nl_sock(struct nl_sock *sock)
> >  #endif
> >  }
> >  
> > +static size_t
> > +dpif_handler_channels_count(const struct dpif_handler *handler)
> > +{
> > +    return hmap_count(&handler->channels);
> > +}
> > +
> > +static struct dpif_channel *
> > +dpif_handler_channel_find(struct dpif_handler *handler, uint32_t idx)
> > +{
> > +    struct dpif_channel *channel;
> > +
> > +    HMAP_FOR_EACH_WITH_HASH (channel, hmap_node, hash_int(idx, 0),
> > +                             &handler->channels) {
> > +        if (channel->port_idx == idx) {
> > +            return channel;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static void
> > +dpif_channel_del(struct dpif_netlink *dpif,
> > +                 struct dpif_channel *channel)
> 
> As suggested above, it might make sense to just pass 'handler' here and
> avoid having back-pointer inside the channel structure.
> 
> > +{
> > +    struct dpif_handler *handler = channel->handler;
> > +
> > +#ifndef _WIN32
> > +    epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
> > +              nl_sock_fd(channel->sock), NULL);
> > +    nl_sock_destroy(channel->sock);
> > +#endif
> > +    hmap_remove(&handler->channels, &channel->hmap_node);
> > +    id_pool_free_id(dpif->port_ids, channel->port_idx);
> > +    handler->event_offset = handler->n_events = 0;
> > +    free(channel);
> > +}
> > +
> >  static struct dpif_netlink *
> >  dpif_netlink_cast(const struct dpif *dpif)
> >  {
> > @@ -385,6 +428,7 @@ open_dpif(const struct dpif_netlink_dp *dp, struct dpif **dpifp)
> >  
> >      dpif->dp_ifindex = dp->dp_ifindex;
> >      dpif->user_features = dp->user_features;
> > +    dpif->port_ids = id_pool_create(0, MAX_PORTS);
> >      *dpifp = &dpif->dpif;
> >  
> >      return 0;
> > @@ -446,167 +490,151 @@ error:
> >  }
> >  #endif /* _WIN32 */
> >  
> > -/* Given the port number 'port_idx', extracts the pid of netlink socket
> > +/* Given the port number 'port_no', extracts the pid of netlink socket
> >   * associated to the port and assigns it to 'upcall_pid'. */
> >  static bool
> > -vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx,
> > +vport_get_pid(struct dpif_netlink *dpif, odp_port_t port_no,
> >                uint32_t *upcall_pid)
> 
> This function seems redundant as it does almost same thing as
> dpif_netlink_port_get_pid__().  Can we combine them?
> 
> >  {
> > -    /* Since the nl_sock can only be assigned in either all
> > -     * or none "dpif" channels, the following check
> > -     * would suffice. */
> > -    if (!dpif->channels[port_idx].sock) {
> > +    uint32_t port_idx = odp_to_u32(port_no);
> > +    size_t i;
> > +
> > +    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
> > +
> > +    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
> >          return false;
> >      }
> > -    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
> >  
> > -    *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock);
> > +    /* The 'port_idx' should only be valid for a single handler. */
> > +    for (i = 0; i < dpif->n_handlers; i++) {
> > +        struct dpif_channel *channel;
> > +
> > +        channel = dpif_handler_channel_find(&dpif->handlers[i], port_idx);
> > +        if (channel) {
> > +            *upcall_pid = nl_sock_pid(channel->sock);
> > +            return true;
> > +        }
> > +    }
> >  
> > -    return true;
> > +    return false;
> >  }
> >  
> >  static int
> >  vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
> >                    struct nl_sock *sock)
> >  {
> > -    struct epoll_event event;
> >      uint32_t port_idx = odp_to_u32(port_no);
> > +    struct dpif_channel *channel;
> > +    struct dpif_handler *handler;
> > +    struct epoll_event event;
> >      size_t i;
> > -    int error;
> >  
> > -    if (dpif->handlers == NULL) {
> > +    if (dpif->handlers == NULL ||
> > +        id_pool_has_id(dpif->port_ids, port_idx)) {
> >          close_nl_sock(sock);
> > -        return 0;
> > +        return EINVAL;
> >      }
> >  
> > -    /* We assume that the datapath densely chooses port numbers, which can
> > -     * therefore be used as an index into 'channels' and 'epoll_events' of
> > -     * 'dpif'. */
> > -    if (port_idx >= dpif->uc_array_size) {
> > -        uint32_t new_size = port_idx + 1;
> > -
> > -        if (new_size > MAX_PORTS) {
> > -            VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
> > -                         dpif_name(&dpif->dpif), port_no);
> > -            return EFBIG;
> > -        }
> > +    if (port_idx >= MAX_PORTS) {
> > +        VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
> > +                     dpif_name(&dpif->dpif), port_no);
> 
> This looks a bit wierd since we're checking port_idx and printing port_no
> instead.  Might be refined if we will just use port_no everywhere.
> 
> > +        return EFBIG;
> > +    }
> >  
> > -        dpif->channels = xrealloc(dpif->channels,
> > -                                  new_size * sizeof *dpif->channels);
> > +    handler = &dpif->handlers[0];
> >  
> > -        for (i = dpif->uc_array_size; i < new_size; i++) {
> > -            dpif->channels[i].sock = NULL;
> > +    for (i = 0; i < dpif->n_handlers; i++) {
> > +        if (dpif_handler_channels_count(&dpif->handlers[i]) <
> > +            dpif_handler_channels_count(handler)) {
> > +            handler = &dpif->handlers[i];
> >          }
> > +    }
> >  
> > -        for (i = 0; i < dpif->n_handlers; i++) {
> > -            struct dpif_handler *handler = &dpif->handlers[i];
> > +    channel = xmalloc(sizeof *channel);
> > +    channel->port_idx = port_idx;
> > +    channel->sock = sock;
> > +    channel->last_poll = LLONG_MIN;
> > +    channel->handler = handler;
> > +    hmap_insert(&handler->channels, &channel->hmap_node,
> > +                hash_int(port_idx, 0));
> >  
> > -            handler->epoll_events = xrealloc(handler->epoll_events,
> > -                new_size * sizeof *handler->epoll_events);
> > -
> > -        }
> > -        dpif->uc_array_size = new_size;
> > -    }
> > +    handler->epoll_events = xrealloc(handler->epoll_events,
> > +                                     dpif_handler_channels_count(handler) *
> > +                                     sizeof *handler->epoll_events);
> >  
> >      memset(&event, 0, sizeof event);
> >      event.events = EPOLLIN | EPOLLEXCLUSIVE;
> >      event.data.u32 = port_idx;
> >  
> > -    for (i = 0; i < dpif->n_handlers; i++) {
> > -        struct dpif_handler *handler = &dpif->handlers[i];
> > -
> >  #ifndef _WIN32
> > -        if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
> > -                      &event) < 0) {
> > -            error = errno;
> > -            goto error;
> > -        }
> > -#endif
> > -    }
> > -    dpif->channels[port_idx].sock = sock;
> > -    dpif->channels[port_idx].last_poll = LLONG_MIN;
> > -
> > -    return 0;
> > -
> > -error:
> > -#ifndef _WIN32
> > -    while (i--) {
> > -        epoll_ctl(dpif->handlers[i].epoll_fd, EPOLL_CTL_DEL,
> > -                  nl_sock_fd(sock), NULL);
> > +    if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
> > +                  &event) < 0) {
> > +        hmap_remove(&handler->channels, &channel->hmap_node);
> > +        free(channel);
> > +        id_pool_free_id(dpif->port_ids, port_idx);
> > +        return errno;
> >      }
> >  #endif
> > -    dpif->channels[port_idx].sock = NULL;
> >  
> > -    return error;
> > +    id_pool_add(dpif->port_ids, port_idx);
> > +    return 0;
> >  }
> >  
> > -static void
> > -vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no)
> > +static bool
> > +vport_del_channel(struct dpif_netlink *dpif, odp_port_t port_no)
> >  {
> > -    uint32_t port_idx = odp_to_u32(port_no);
> >      size_t i;
> >  
> > -    if (!dpif->handlers || port_idx >= dpif->uc_array_size
> > -        || !dpif->channels[port_idx].sock) {
> > -        return;
> > -    }
> > -
> >      for (i = 0; i < dpif->n_handlers; i++) {
> > -        struct dpif_handler *handler = &dpif->handlers[i];
> > -#ifndef _WIN32
> > -        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
> > -                  nl_sock_fd(dpif->channels[port_idx].sock), NULL);
> > -#endif
> > -        handler->event_offset = handler->n_events = 0;
> > +        struct dpif_channel *channel;
> > +
> > +        channel = dpif_handler_channel_find(&dpif->handlers[i],
> > +                                            odp_to_u32(port_no));
> > +        if (channel) {
> > +            dpif_channel_del(dpif, channel);
> > +            return true;
> > +        }
> >      }
> > -#ifndef _WIN32
> > -    nl_sock_destroy(dpif->channels[port_idx].sock);
> > -#endif
> > -    dpif->channels[port_idx].sock = NULL;
> > +
> > +    return false;
> >  }
> >  
> >  static void
> >  destroy_all_channels(struct dpif_netlink *dpif)
> >      OVS_REQ_WRLOCK(dpif->upcall_lock)
> >  {
> > -    unsigned int i;
> > +    size_t i;
> >  
> >      if (!dpif->handlers) {
> >          return;
> >      }
> >  
> > -    for (i = 0; i < dpif->uc_array_size; i++ ) {
> > -        struct dpif_netlink_vport vport_request;
> > -        uint32_t upcall_pids = 0;
> > -
> > -        if (!dpif->channels[i].sock) {
> > -            continue;
> > -        }
> > -
> > -        /* Turn off upcalls. */
> > -        dpif_netlink_vport_init(&vport_request);
> > -        vport_request.cmd = OVS_VPORT_CMD_SET;
> > -        vport_request.dp_ifindex = dpif->dp_ifindex;
> > -        vport_request.port_no = u32_to_odp(i);
> > -        vport_request.n_upcall_pids = 1;
> > -        vport_request.upcall_pids = &upcall_pids;
> > -        dpif_netlink_vport_transact(&vport_request, NULL, NULL);
> > -
> > -        vport_del_channels(dpif, u32_to_odp(i));
> > -    }
> > -
> >      for (i = 0; i < dpif->n_handlers; i++) {
> >          struct dpif_handler *handler = &dpif->handlers[i];
> > +        struct dpif_channel *channel, *next;
> > +
> > +        HMAP_FOR_EACH_SAFE (channel, next, hmap_node, &handler->channels) {
> > +            struct dpif_netlink_vport vport_request;
> > +            uint32_t upcall_pids = 0;
> 
> Empty line, please.
> 
> > +            /* Turn off upcalls. */
> > +            dpif_netlink_vport_init(&vport_request);
> > +            vport_request.cmd = OVS_VPORT_CMD_SET;
> > +            vport_request.dp_ifindex = dpif->dp_ifindex;
> > +            vport_request.port_no = u32_to_odp(channel->port_idx);
> > +            vport_request.n_upcall_pids = 1;
> > +            vport_request.upcall_pids = &upcall_pids;
> > +            dpif_netlink_vport_transact(&vport_request, NULL, NULL);
> > +            ovs_assert(channel != NULL);
> > +            dpif_channel_del(dpif, channel);
> > +        }
> >  
> >          dpif_netlink_handler_uninit(handler);
> >          free(handler->epoll_events);
> >      }
> > -    free(dpif->channels);
> > +
> >      free(dpif->handlers);
> >      dpif->handlers = NULL;
> > -    dpif->channels = NULL;
> >      dpif->n_handlers = 0;
> > -    dpif->uc_array_size = 0;
> >  }
> >  
> >  static void
> > @@ -618,9 +646,11 @@ dpif_netlink_close(struct dpif *dpif_)
> >  
> >      fat_rwlock_wrlock(&dpif->upcall_lock);
> >      destroy_all_channels(dpif);
> > +    id_pool_destroy(dpif->port_ids);
> >      fat_rwlock_unlock(&dpif->upcall_lock);
> >  
> >      fat_rwlock_destroy(&dpif->upcall_lock);
> > +
> >      free(dpif);
> >  }
> >  
> > @@ -1003,7 +1033,9 @@ dpif_netlink_port_del__(struct dpif_netlink *dpif, odp_port_t port_no)
> >  #endif
> >      error = dpif_netlink_vport_transact(&vport, NULL, NULL);
> >  
> > -    vport_del_channels(dpif, port_no);
> > +    if (!vport_del_channel(dpif, port_no)) {
> 
> If packets receiving disabled with dpif_recv_set(), there will be no channels
> and this check will fail.  And if it we will return here we will leave tunnel
> port in a system which is not a good thing.
> 
> And we will also leak 'dpif_port' here.
> 
> > +        return EINVAL;
> > +    }
> >  
> >      if (!error && !ovs_tunnels_out_of_tree) {
> >          error = dpif_netlink_rtnl_port_destroy(dpif_port.name, dpif_port.type);
> > @@ -1084,23 +1116,39 @@ dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif,
> >                              odp_port_t port_no)
> >      OVS_REQ_RDLOCK(dpif->upcall_lock)
> >  {
> > +    struct dpif_channel *min_channel = NULL, *found_channel = NULL;
> >      uint32_t port_idx = odp_to_u32(port_no);
> >      uint32_t pid = 0;
> > +    size_t i;
> > +
> > +    /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
> > +     * channel, since it is not heavily loaded. */
> > +    uint32_t idx = port_no == ODPP_NONE ? 0 : port_idx;
> > +
> > +    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
> > +        return 0;
> > +    }
> > +
> > +    if (dpif->handlers) {
> > +        for (i = 0; i < dpif->n_handlers; i++) {
> > +            if (dpif_handler_channel_find(&dpif->handlers[i], idx)) {
> > +                found_channel = dpif_handler_channel_find(&dpif->handlers[i],
> > +                                                          idx);
> 
> Looking up twice seems inefficient.  Same for the 0 case below.
> 
> > +                break;
> > +            }
> >  
> > -    if (dpif->handlers && dpif->uc_array_size > 0) {
> > -        /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
> > -         * channel, since it is not heavily loaded. */
> > -        uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx;
> > -
> > -        /* Needs to check in case the socket pointer is changed in between
> > -         * the holding of upcall_lock.  A known case happens when the main
> > -         * thread deletes the vport while the handler thread is handling
> > -         * the upcall from that port. */
> > -        if (dpif->channels[idx].sock) {
> > -            pid = nl_sock_pid(dpif->channels[idx].sock);
> > +            if (dpif_handler_channel_find(&dpif->handlers[i], 0)) {
> > +                min_channel = dpif_handler_channel_find(&dpif->handlers[i], 0);
> > +            }
> >          }
> >      }
> >  
> > +    if (found_channel) {
> > +        pid = nl_sock_pid(found_channel->sock);
> > +    } else if (min_channel) {
> > +        pid = nl_sock_pid(min_channel->sock);
> 
> This might be a question for Aaron mostly.
> 
> Could you, please, elaborate why we have a behavior change here?
> Old code just returnd 0 as a pid, but after this change we will return
> pid of a channel of port 0.
> 
> > +    }
> > +
> >      return pid;
> >  }
> >  
> > @@ -2342,12 +2390,14 @@ dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops,
> >  static void
> >  dpif_netlink_handler_uninit(struct dpif_handler *handler)
> >  {
> > +    hmap_destroy(&handler->channels);
> >      vport_delete_sock_pool(handler);
> >  }
> >  
> >  static int
> >  dpif_netlink_handler_init(struct dpif_handler *handler)
> >  {
> > +    hmap_init(&handler->channels);
> >      return vport_create_sock_pool(handler);
> >  }
> >  #else
> > @@ -2355,6 +2405,7 @@ dpif_netlink_handler_init(struct dpif_handler *handler)
> >  static int
> >  dpif_netlink_handler_init(struct dpif_handler *handler)
> >  {
> > +    hmap_init(&handler->channels);
> >      handler->epoll_fd = epoll_create(10);
> >      return handler->epoll_fd < 0 ? errno : 0;
> >  }
> > @@ -2362,6 +2413,7 @@ dpif_netlink_handler_init(struct dpif_handler *handler)
> >  static void
> >  dpif_netlink_handler_uninit(struct dpif_handler *handler)
> >  {
> > +    hmap_destroy(&handler->channels);
> >      close(handler->epoll_fd);
> >  }
> >  #endif
> > @@ -2374,9 +2426,7 @@ static int
> >  dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
> >      OVS_REQ_WRLOCK(dpif->upcall_lock)
> >  {
> > -    unsigned long int *keep_channels;
> >      struct dpif_netlink_vport vport;
> > -    size_t keep_channels_nbits;
> >      struct nl_dump dump;
> >      uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
> >      struct ofpbuf buf;
> > @@ -2412,22 +2462,16 @@ dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
> >  
> >      for (i = 0; i < n_handlers; i++) {
> >          struct dpif_handler *handler = &dpif->handlers[i];
> > -
> 
> Please, keep.
> 
> >          handler->event_offset = handler->n_events = 0;
> >      }
> >  
> > -    keep_channels_nbits = dpif->uc_array_size;
> > -    keep_channels = bitmap_allocate(keep_channels_nbits);
> > -
> >      ofpbuf_use_stub(&buf, reply_stub, sizeof reply_stub);
> >      dpif_netlink_port_dump_start__(dpif, &dump);
> >      while (!dpif_netlink_port_dump_next__(dpif, &dump, &vport, &buf)) {
> > -        uint32_t port_no = odp_to_u32(vport.port_no);
> >          uint32_t upcall_pid;
> >          int error;
> >  
> > -        if (port_no >= dpif->uc_array_size
> > -            || !vport_get_pid(dpif, port_no, &upcall_pid)) {
> > +        if (!vport_get_pid(dpif, vport.port_no, &upcall_pid)) {
> >              struct nl_sock *sock;
> >              error = create_nl_sock(dpif, &sock);
> >  
> > @@ -2475,25 +2519,15 @@ dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
> >              }
> >          }
> >  
> > -        if (port_no < keep_channels_nbits) {
> > -            bitmap_set1(keep_channels, port_no);
> > -        }
> >          continue;
> >  
> >      error:
> > -        vport_del_channels(dpif, vport.port_no);
> > +        vport_del_channel(dpif, vport.port_no);
> >      }
> > +
> >      nl_dump_done(&dump);
> >      ofpbuf_uninit(&buf);
> >  
> > -    /* Discard any saved channels that we didn't reuse. */
> > -    for (i = 0; i < keep_channels_nbits; i++) {
> > -        if (!bitmap_is_set(keep_channels, i)) {
> > -            vport_del_channels(dpif, u32_to_odp(i));
> > -        }
> > -    }
> > -    free(keep_channels);
> > -
> >      return retval;
> >  }
> >  
> > @@ -2714,6 +2748,8 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
> >      OVS_REQ_RDLOCK(dpif->upcall_lock)
> >  {
> >      struct dpif_handler *handler;
> > +    size_t n_channels;
> > +
> >      int read_tries = 0;
> >  
> >      if (!dpif->handlers || handler_id >= dpif->n_handlers) {
> > @@ -2721,14 +2757,15 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
> >      }
> >  
> >      handler = &dpif->handlers[handler_id];
> > -    if (handler->event_offset >= handler->n_events) {
> > +    n_channels = dpif_handler_channels_count(handler);
> > +    if (handler->event_offset >= handler->n_events && n_channels) {
> >          int retval;
> >  
> >          handler->event_offset = handler->n_events = 0;
> >  
> >          do {
> >              retval = epoll_wait(handler->epoll_fd, handler->epoll_events,
> > -                                dpif->uc_array_size, 0);
> > +                                n_channels, 0);
> >          } while (retval < 0 && errno == EINTR);
> >  
> >          if (retval < 0) {
> > @@ -2741,7 +2778,10 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
> >  
> >      while (handler->event_offset < handler->n_events) {
> >          int idx = handler->epoll_events[handler->event_offset].data.u32;
> > -        struct dpif_channel *ch = &dpif->channels[idx];
> > +        struct dpif_channel *ch = dpif_handler_channel_find(handler, idx);
> > +        if (!ch) {
> > +            return EAGAIN;
> 
> How can it be that handler received event on port that it doesn't have channel for?
> Maybe assertion will be better here?
> 
> > +        }
> >  
> >          handler->event_offset++;
> >  
> > @@ -2845,12 +2885,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif)
> >      if (dpif->handlers) {
> >          size_t i;
> >  
> > -        if (!dpif->channels[0].sock) {
> > -            return;
> > -        }
> > -        for (i = 0; i < dpif->uc_array_size; i++ ) {
> > +        for (i = 0; i < dpif->n_handlers; i++) {
> > +            struct dpif_handler *handler = &dpif->handlers[i];
> > +            struct dpif_channel *channel;
> > +
> > +            HMAP_FOR_EACH (channel, hmap_node, &handler->channels) {
> > +                nl_sock_drain(channel->sock);
> > +            }
> >  
> > -            nl_sock_drain(dpif->channels[i].sock);
> >          }
> >      }
> >  }
> > diff --git a/lib/id-pool.c b/lib/id-pool.c
> > index 69910ad08..bef822f6b 100644
> > --- a/lib/id-pool.c
> > +++ b/lib/id-pool.c
> > @@ -93,6 +93,16 @@ id_pool_find(struct id_pool *pool, uint32_t id)
> >      return NULL;
> >  }
> >  
> > +bool
> > +id_pool_has_id(struct id_pool *pool, uint32_t id)
> > +{
> > +    if (!id_pool_find(pool, id)) {
> > +        return false;
> > +    }
> > +
> > +    return true;
> > +}
> > +
> >  void
> >  id_pool_add(struct id_pool *pool, uint32_t id)
> >  {
> > diff --git a/lib/id-pool.h b/lib/id-pool.h
> > index 8721f8793..62876e2a5 100644
> > --- a/lib/id-pool.h
> > +++ b/lib/id-pool.h
> > @@ -29,6 +29,7 @@ void id_pool_destroy(struct id_pool *);
> >  bool id_pool_alloc_id(struct id_pool *, uint32_t *id);
> >  void id_pool_free_id(struct id_pool *, uint32_t id);
> >  void id_pool_add(struct id_pool *, uint32_t id);
> > +bool id_pool_has_id(struct id_pool *, uint32_t id);
> >  
> >  /*
> >   * ID pool.
> > 
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Aaron Conole Sept. 17, 2020, 7:49 p.m. UTC | #5
Ilya Maximets <i.maximets@ovn.org> writes:

> On 9/9/20 7:08 PM, Mark Gray wrote:
>> From: Aaron Conole <aconole@redhat.com>
>> 
>> Currently, the channel handlers are polled globally.  On some
>> systems, this causes a thundering herd issue where multiple
>> handler threads become active, only to do no work and immediately
>> sleep.
>> 
>> The approach here is to push the netlink socket channels to discreet
>> handler threads to process, rather than polling on every thread.
>> This will eliminate the need to wake multiple threads.
>> 
>> To check:
>> 
>>   ip netns add left
>>   ip netns add right
>>   ip link add center-left type veth peer name left0 netns left
>>   ip link add center-right type veth peer name right0 netns right
>>   ip link set center-left up
>>   ip link set center-right up
>>   ip -n left ip link set left0 up
>>   ip -n left ip addr add 172.31.110.10/24 dev left0
>>   ip -n right ip link set right0 up
>>   ip -n right ip addr add 172.31.110.11/24 dev right0
>> 
>>   ovs-vsctl add-br br0
>>   ovs-vsctl add-port br0 center-right
>>   ovs-vsctl add-port br0 center-left
>> 
>>   # in one terminal
>>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
>> 
>>   # in a separate terminal
>>   ip netns exec left arping -I left0 -c 1 172.31.110.11
>> 
>>   # in the perf terminal after exiting
>>   perf script
>> 
>> Look for the number of 'handler' threads which were made active.
>> 
>> Suggested-by: Ben Pfaff <blp@ovn.org>
>> Co-authored-by: Mark Gray <mark.d.gray@redhat.com>
>> Reported-by: David Ahern <dsahern@gmail.com>
>> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
>> Cc: Matteo Croce <technoboy85@gmail.com>
>> Cc: Flavio Leitner <fbl@sysclose.org>
>> Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets")
>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>> Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
>> ---
>> 
>> 
>> v2: Oops - forgot to commit my whitespace cleanups.
>> v3: one port latency results as per Matteo's comments
>> 
>>     Stock:
>>       min/avg/max/mdev = 21.5/36.5/96.5/1.0 us
>>     With Patch:
>>       min/avg/max/mdev = 5.3/9.7/98.4/0.5 us 
>> v4: Oops - first email did not send
>> v5:   min/avg/max/mdev = 9.3/10.4/33.6/2.2 us
>> v6: Split out the AUTHORS patch and added a cover letter as
>>     per Ilya's suggestion.
>>     Fixed 0-day issues.
>> v7: Merged patches as per Flavio's suggestion. This is
>>     no longer a series. Fixed some other small issues.
>> 
>>  lib/dpif-netlink.c | 330 +++++++++++++++++++++++++--------------------
>>  lib/id-pool.c      |  10 ++
>>  lib/id-pool.h      |   1 +
>>  3 files changed, 197 insertions(+), 144 deletions(-)
>> 
>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
>> index 7da4fb54d..a74838506 100644
>> --- a/lib/dpif-netlink.c
>> +++ b/lib/dpif-netlink.c
>> @@ -37,6 +37,7 @@
>>  #include "dpif-provider.h"
>>  #include "fat-rwlock.h"
>>  #include "flow.h"
>> +#include "id-pool.h"
>>  #include "netdev-linux.h"
>>  #include "netdev-offload.h"
>>  #include "netdev-provider.h"
>> @@ -160,6 +161,9 @@ static void dpif_netlink_flow_to_dpif_flow(struct dpif_flow *,
>>  
>>  /* One of the dpif channels between the kernel and userspace. */
>>  struct dpif_channel {
>> +    struct hmap_node hmap_node;
>
> This needs a comment and all other new structure fileds too.
> Ex.:
>     struct hmap_node hmap_node; /* In struct dpif_handler's channels hmap. */
>
>> +    struct dpif_handler *handler;
>
> Why do we need this back-pointer?  Maybe it's better to just add 'handler'
> argument to dpif_channel_del() ?
>
>> +    uint32_t port_idx;
>
> I see a lot of 'port_idx' here and there, but it's not really an index anymore.
> You removed all the arrays that used these variables as index.
> Can we just use 'port_no' directly here?  I know that there will be conversions
> from odp to u32, but this should be more clear.  If conversions are the issue,
> we could store u32 version of 'port_no' here.
>
> And, please, add a comment.
>
>>      struct nl_sock *sock;       /* Netlink socket. */
>>      long long int last_poll;    /* Last time this channel was polled. */
>>  };
>> @@ -183,6 +187,8 @@ struct dpif_handler {
>>      int n_events;                 /* Num events returned by epoll_wait(). */
>>      int event_offset;             /* Offset into 'epoll_events'. */
>>  
>> +    struct hmap channels;         /* map of channels for each port. */
>
> Comment style: Capital letter.
>
>> +
>>  #ifdef _WIN32
>>      /* Pool of sockets. */
>>      struct dpif_windows_vport_sock *vport_sock_pool;
>> @@ -201,9 +207,8 @@ struct dpif_netlink {
>>      struct fat_rwlock upcall_lock;
>>      struct dpif_handler *handlers;
>>      uint32_t n_handlers;           /* Num of upcall handlers. */
>> -    struct dpif_channel *channels; /* Array of channels for each port. */
>> -    int uc_array_size;             /* Size of 'handler->channels' and */
>> -                                   /* 'handler->epoll_events'. */
>> +
>> +    struct id_pool *port_ids;      /* Set of added port ids */
>
> What is the reason of having this pool?  Datapath port numbers are unique
> except for ODPP_NONE, and all the functions you implemented actually
> iterates over all channels anyway to find one with requested port_idx.
> So, the only purpose of it is to avoid searching on incorect port number,
> but this should be a rare case.  Can we just remove it?
>
> OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
> instead as it already has all required infrastructure and doesn't actually
> care about range of values.
>
>>  
>>      /* Change notification. */
>>      struct nl_sock *port_notifier; /* vport multicast group subscriber. */
>> @@ -287,6 +292,44 @@ close_nl_sock(struct nl_sock *sock)
>>  #endif
>>  }
>>  
>> +static size_t
>> +dpif_handler_channels_count(const struct dpif_handler *handler)
>> +{
>> +    return hmap_count(&handler->channels);
>> +}
>> +
>> +static struct dpif_channel *
>> +dpif_handler_channel_find(struct dpif_handler *handler, uint32_t idx)
>> +{
>> +    struct dpif_channel *channel;
>> +
>> +    HMAP_FOR_EACH_WITH_HASH (channel, hmap_node, hash_int(idx, 0),
>> +                             &handler->channels) {
>> +        if (channel->port_idx == idx) {
>> +            return channel;
>> +        }
>> +    }
>> +
>> +    return NULL;
>> +}
>> +
>> +static void
>> +dpif_channel_del(struct dpif_netlink *dpif,
>> +                 struct dpif_channel *channel)
>
> As suggested above, it might make sense to just pass 'handler' here and
> avoid having back-pointer inside the channel structure.
>
>> +{
>> +    struct dpif_handler *handler = channel->handler;
>> +
>> +#ifndef _WIN32
>> +    epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
>> +              nl_sock_fd(channel->sock), NULL);
>> +    nl_sock_destroy(channel->sock);
>> +#endif
>> +    hmap_remove(&handler->channels, &channel->hmap_node);
>> +    id_pool_free_id(dpif->port_ids, channel->port_idx);
>> +    handler->event_offset = handler->n_events = 0;
>> +    free(channel);
>> +}
>> +
>>  static struct dpif_netlink *
>>  dpif_netlink_cast(const struct dpif *dpif)
>>  {
>> @@ -385,6 +428,7 @@ open_dpif(const struct dpif_netlink_dp *dp, struct dpif **dpifp)
>>  
>>      dpif->dp_ifindex = dp->dp_ifindex;
>>      dpif->user_features = dp->user_features;
>> +    dpif->port_ids = id_pool_create(0, MAX_PORTS);
>>      *dpifp = &dpif->dpif;
>>  
>>      return 0;
>> @@ -446,167 +490,151 @@ error:
>>  }
>>  #endif /* _WIN32 */
>>  
>> -/* Given the port number 'port_idx', extracts the pid of netlink socket
>> +/* Given the port number 'port_no', extracts the pid of netlink socket
>>   * associated to the port and assigns it to 'upcall_pid'. */
>>  static bool
>> -vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx,
>> +vport_get_pid(struct dpif_netlink *dpif, odp_port_t port_no,
>>                uint32_t *upcall_pid)
>
> This function seems redundant as it does almost same thing as
> dpif_netlink_port_get_pid__().  Can we combine them?
>
>>  {
>> -    /* Since the nl_sock can only be assigned in either all
>> -     * or none "dpif" channels, the following check
>> -     * would suffice. */
>> -    if (!dpif->channels[port_idx].sock) {
>> +    uint32_t port_idx = odp_to_u32(port_no);
>> +    size_t i;
>> +
>> +    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
>> +
>> +    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
>>          return false;
>>      }
>> -    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
>>  
>> -    *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock);
>> +    /* The 'port_idx' should only be valid for a single handler. */
>> +    for (i = 0; i < dpif->n_handlers; i++) {
>> +        struct dpif_channel *channel;
>> +
>> +        channel = dpif_handler_channel_find(&dpif->handlers[i], port_idx);
>> +        if (channel) {
>> +            *upcall_pid = nl_sock_pid(channel->sock);
>> +            return true;
>> +        }
>> +    }
>>  
>> -    return true;
>> +    return false;
>>  }
>>  
>>  static int
>>  vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
>>                    struct nl_sock *sock)
>>  {
>> -    struct epoll_event event;
>>      uint32_t port_idx = odp_to_u32(port_no);
>> +    struct dpif_channel *channel;
>> +    struct dpif_handler *handler;
>> +    struct epoll_event event;
>>      size_t i;
>> -    int error;
>>  
>> -    if (dpif->handlers == NULL) {
>> +    if (dpif->handlers == NULL ||
>> +        id_pool_has_id(dpif->port_ids, port_idx)) {
>>          close_nl_sock(sock);
>> -        return 0;
>> +        return EINVAL;
>>      }
>>  
>> -    /* We assume that the datapath densely chooses port numbers, which can
>> -     * therefore be used as an index into 'channels' and 'epoll_events' of
>> -     * 'dpif'. */
>> -    if (port_idx >= dpif->uc_array_size) {
>> -        uint32_t new_size = port_idx + 1;
>> -
>> -        if (new_size > MAX_PORTS) {
>> -            VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
>> -                         dpif_name(&dpif->dpif), port_no);
>> -            return EFBIG;
>> -        }
>> +    if (port_idx >= MAX_PORTS) {
>> +        VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
>> +                     dpif_name(&dpif->dpif), port_no);
>
> This looks a bit wierd since we're checking port_idx and printing port_no
> instead.  Might be refined if we will just use port_no everywhere.
>
>> +        return EFBIG;
>> +    }
>>  
>> -        dpif->channels = xrealloc(dpif->channels,
>> -                                  new_size * sizeof *dpif->channels);
>> +    handler = &dpif->handlers[0];
>>  
>> -        for (i = dpif->uc_array_size; i < new_size; i++) {
>> -            dpif->channels[i].sock = NULL;
>> +    for (i = 0; i < dpif->n_handlers; i++) {
>> +        if (dpif_handler_channels_count(&dpif->handlers[i]) <
>> +            dpif_handler_channels_count(handler)) {
>> +            handler = &dpif->handlers[i];
>>          }
>> +    }
>>  
>> -        for (i = 0; i < dpif->n_handlers; i++) {
>> -            struct dpif_handler *handler = &dpif->handlers[i];
>> +    channel = xmalloc(sizeof *channel);
>> +    channel->port_idx = port_idx;
>> +    channel->sock = sock;
>> +    channel->last_poll = LLONG_MIN;
>> +    channel->handler = handler;
>> +    hmap_insert(&handler->channels, &channel->hmap_node,
>> +                hash_int(port_idx, 0));
>>  
>> -            handler->epoll_events = xrealloc(handler->epoll_events,
>> -                new_size * sizeof *handler->epoll_events);
>> -
>> -        }
>> -        dpif->uc_array_size = new_size;
>> -    }
>> +    handler->epoll_events = xrealloc(handler->epoll_events,
>> +                                     dpif_handler_channels_count(handler) *
>> +                                     sizeof *handler->epoll_events);
>>  
>>      memset(&event, 0, sizeof event);
>>      event.events = EPOLLIN | EPOLLEXCLUSIVE;
>>      event.data.u32 = port_idx;
>>  
>> -    for (i = 0; i < dpif->n_handlers; i++) {
>> -        struct dpif_handler *handler = &dpif->handlers[i];
>> -
>>  #ifndef _WIN32
>> -        if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
>> -                      &event) < 0) {
>> -            error = errno;
>> -            goto error;
>> -        }
>> -#endif
>> -    }
>> -    dpif->channels[port_idx].sock = sock;
>> -    dpif->channels[port_idx].last_poll = LLONG_MIN;
>> -
>> -    return 0;
>> -
>> -error:
>> -#ifndef _WIN32
>> -    while (i--) {
>> -        epoll_ctl(dpif->handlers[i].epoll_fd, EPOLL_CTL_DEL,
>> -                  nl_sock_fd(sock), NULL);
>> +    if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
>> +                  &event) < 0) {
>> +        hmap_remove(&handler->channels, &channel->hmap_node);
>> +        free(channel);
>> +        id_pool_free_id(dpif->port_ids, port_idx);
>> +        return errno;
>>      }
>>  #endif
>> -    dpif->channels[port_idx].sock = NULL;
>>  
>> -    return error;
>> +    id_pool_add(dpif->port_ids, port_idx);
>> +    return 0;
>>  }
>>  
>> -static void
>> -vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no)
>> +static bool
>> +vport_del_channel(struct dpif_netlink *dpif, odp_port_t port_no)
>>  {
>> -    uint32_t port_idx = odp_to_u32(port_no);
>>      size_t i;
>>  
>> -    if (!dpif->handlers || port_idx >= dpif->uc_array_size
>> -        || !dpif->channels[port_idx].sock) {
>> -        return;
>> -    }
>> -
>>      for (i = 0; i < dpif->n_handlers; i++) {
>> -        struct dpif_handler *handler = &dpif->handlers[i];
>> -#ifndef _WIN32
>> -        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
>> -                  nl_sock_fd(dpif->channels[port_idx].sock), NULL);
>> -#endif
>> -        handler->event_offset = handler->n_events = 0;
>> +        struct dpif_channel *channel;
>> +
>> +        channel = dpif_handler_channel_find(&dpif->handlers[i],
>> +                                            odp_to_u32(port_no));
>> +        if (channel) {
>> +            dpif_channel_del(dpif, channel);
>> +            return true;
>> +        }
>>      }
>> -#ifndef _WIN32
>> -    nl_sock_destroy(dpif->channels[port_idx].sock);
>> -#endif
>> -    dpif->channels[port_idx].sock = NULL;
>> +
>> +    return false;
>>  }
>>  
>>  static void
>>  destroy_all_channels(struct dpif_netlink *dpif)
>>      OVS_REQ_WRLOCK(dpif->upcall_lock)
>>  {
>> -    unsigned int i;
>> +    size_t i;
>>  
>>      if (!dpif->handlers) {
>>          return;
>>      }
>>  
>> -    for (i = 0; i < dpif->uc_array_size; i++ ) {
>> -        struct dpif_netlink_vport vport_request;
>> -        uint32_t upcall_pids = 0;
>> -
>> -        if (!dpif->channels[i].sock) {
>> -            continue;
>> -        }
>> -
>> -        /* Turn off upcalls. */
>> -        dpif_netlink_vport_init(&vport_request);
>> -        vport_request.cmd = OVS_VPORT_CMD_SET;
>> -        vport_request.dp_ifindex = dpif->dp_ifindex;
>> -        vport_request.port_no = u32_to_odp(i);
>> -        vport_request.n_upcall_pids = 1;
>> -        vport_request.upcall_pids = &upcall_pids;
>> -        dpif_netlink_vport_transact(&vport_request, NULL, NULL);
>> -
>> -        vport_del_channels(dpif, u32_to_odp(i));
>> -    }
>> -
>>      for (i = 0; i < dpif->n_handlers; i++) {
>>          struct dpif_handler *handler = &dpif->handlers[i];
>> +        struct dpif_channel *channel, *next;
>> +
>> +        HMAP_FOR_EACH_SAFE (channel, next, hmap_node, &handler->channels) {
>> +            struct dpif_netlink_vport vport_request;
>> +            uint32_t upcall_pids = 0;
>
> Empty line, please.
>
>> +            /* Turn off upcalls. */
>> +            dpif_netlink_vport_init(&vport_request);
>> +            vport_request.cmd = OVS_VPORT_CMD_SET;
>> +            vport_request.dp_ifindex = dpif->dp_ifindex;
>> +            vport_request.port_no = u32_to_odp(channel->port_idx);
>> +            vport_request.n_upcall_pids = 1;
>> +            vport_request.upcall_pids = &upcall_pids;
>> +            dpif_netlink_vport_transact(&vport_request, NULL, NULL);
>> +            ovs_assert(channel != NULL);
>> +            dpif_channel_del(dpif, channel);
>> +        }
>>  
>>          dpif_netlink_handler_uninit(handler);
>>          free(handler->epoll_events);
>>      }
>> -    free(dpif->channels);
>> +
>>      free(dpif->handlers);
>>      dpif->handlers = NULL;
>> -    dpif->channels = NULL;
>>      dpif->n_handlers = 0;
>> -    dpif->uc_array_size = 0;
>>  }
>>  
>>  static void
>> @@ -618,9 +646,11 @@ dpif_netlink_close(struct dpif *dpif_)
>>  
>>      fat_rwlock_wrlock(&dpif->upcall_lock);
>>      destroy_all_channels(dpif);
>> +    id_pool_destroy(dpif->port_ids);
>>      fat_rwlock_unlock(&dpif->upcall_lock);
>>  
>>      fat_rwlock_destroy(&dpif->upcall_lock);
>> +
>>      free(dpif);
>>  }
>>  
>> @@ -1003,7 +1033,9 @@ dpif_netlink_port_del__(struct dpif_netlink *dpif, odp_port_t port_no)
>>  #endif
>>      error = dpif_netlink_vport_transact(&vport, NULL, NULL);
>>  
>> -    vport_del_channels(dpif, port_no);
>> +    if (!vport_del_channel(dpif, port_no)) {
>
> If packets receiving disabled with dpif_recv_set(), there will be no channels
> and this check will fail.  And if it we will return here we will leave tunnel
> port in a system which is not a good thing.
>
> And we will also leak 'dpif_port' here.

Agreed.

>> +        return EINVAL;
>> +    }
>>  
>>      if (!error && !ovs_tunnels_out_of_tree) {
>>          error = dpif_netlink_rtnl_port_destroy(dpif_port.name, dpif_port.type);
>> @@ -1084,23 +1116,39 @@ dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif,
>>                              odp_port_t port_no)
>>      OVS_REQ_RDLOCK(dpif->upcall_lock)
>>  {
>> +    struct dpif_channel *min_channel = NULL, *found_channel = NULL;
>>      uint32_t port_idx = odp_to_u32(port_no);
>>      uint32_t pid = 0;
>> +    size_t i;
>> +
>> +    /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
>> +     * channel, since it is not heavily loaded. */
>> +    uint32_t idx = port_no == ODPP_NONE ? 0 : port_idx;
>> +
>> +    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
>> +        return 0;
>> +    }
>> +
>> +    if (dpif->handlers) {
>> +        for (i = 0; i < dpif->n_handlers; i++) {
>> +            if (dpif_handler_channel_find(&dpif->handlers[i], idx)) {
>> +                found_channel = dpif_handler_channel_find(&dpif->handlers[i],
>> +                                                          idx);
>
> Looking up twice seems inefficient.  Same for the 0 case below.
>
>> +                break;
>> +            }
>>  
>> -    if (dpif->handlers && dpif->uc_array_size > 0) {
>> -        /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
>> -         * channel, since it is not heavily loaded. */
>> -        uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx;
>> -
>> -        /* Needs to check in case the socket pointer is changed in between
>> -         * the holding of upcall_lock.  A known case happens when the main
>> -         * thread deletes the vport while the handler thread is handling
>> -         * the upcall from that port. */
>> -        if (dpif->channels[idx].sock) {
>> -            pid = nl_sock_pid(dpif->channels[idx].sock);
>> +            if (dpif_handler_channel_find(&dpif->handlers[i], 0)) {
>> +                min_channel = dpif_handler_channel_find(&dpif->handlers[i], 0);
>> +            }
>>          }
>>      }
>>  
>> +    if (found_channel) {
>> +        pid = nl_sock_pid(found_channel->sock);
>> +    } else if (min_channel) {
>> +        pid = nl_sock_pid(min_channel->sock);
>
> This might be a question for Aaron mostly.
>
> Could you, please, elaborate why we have a behavior change here?
> Old code just returnd 0 as a pid, but after this change we will return
> pid of a channel of port 0.

I see what you mean - previously, we only got pid for index 0 when the
port index was out of range.  Now we will get it even when the port
index is in range, but not found.  That shouldn't be - it is my mistake.

>> +    }
>> +
>>      return pid;
>>  }
>>  
>> @@ -2342,12 +2390,14 @@ dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops,
>>  static void
>>  dpif_netlink_handler_uninit(struct dpif_handler *handler)
>>  {
>> +    hmap_destroy(&handler->channels);
>>      vport_delete_sock_pool(handler);
>>  }
>>  
>>  static int
>>  dpif_netlink_handler_init(struct dpif_handler *handler)
>>  {
>> +    hmap_init(&handler->channels);
>>      return vport_create_sock_pool(handler);
>>  }
>>  #else
>> @@ -2355,6 +2405,7 @@ dpif_netlink_handler_init(struct dpif_handler *handler)
>>  static int
>>  dpif_netlink_handler_init(struct dpif_handler *handler)
>>  {
>> +    hmap_init(&handler->channels);
>>      handler->epoll_fd = epoll_create(10);
>>      return handler->epoll_fd < 0 ? errno : 0;
>>  }
>> @@ -2362,6 +2413,7 @@ dpif_netlink_handler_init(struct dpif_handler *handler)
>>  static void
>>  dpif_netlink_handler_uninit(struct dpif_handler *handler)
>>  {
>> +    hmap_destroy(&handler->channels);
>>      close(handler->epoll_fd);
>>  }
>>  #endif
>> @@ -2374,9 +2426,7 @@ static int
>>  dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
>>      OVS_REQ_WRLOCK(dpif->upcall_lock)
>>  {
>> -    unsigned long int *keep_channels;
>>      struct dpif_netlink_vport vport;
>> -    size_t keep_channels_nbits;
>>      struct nl_dump dump;
>>      uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
>>      struct ofpbuf buf;
>> @@ -2412,22 +2462,16 @@ dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
>>  
>>      for (i = 0; i < n_handlers; i++) {
>>          struct dpif_handler *handler = &dpif->handlers[i];
>> -
>
> Please, keep.
>
>>          handler->event_offset = handler->n_events = 0;
>>      }
>>  
>> -    keep_channels_nbits = dpif->uc_array_size;
>> -    keep_channels = bitmap_allocate(keep_channels_nbits);
>> -
>>      ofpbuf_use_stub(&buf, reply_stub, sizeof reply_stub);
>>      dpif_netlink_port_dump_start__(dpif, &dump);
>>      while (!dpif_netlink_port_dump_next__(dpif, &dump, &vport, &buf)) {
>> -        uint32_t port_no = odp_to_u32(vport.port_no);
>>          uint32_t upcall_pid;
>>          int error;
>>  
>> -        if (port_no >= dpif->uc_array_size
>> -            || !vport_get_pid(dpif, port_no, &upcall_pid)) {
>> +        if (!vport_get_pid(dpif, vport.port_no, &upcall_pid)) {
>>              struct nl_sock *sock;
>>              error = create_nl_sock(dpif, &sock);
>>  
>> @@ -2475,25 +2519,15 @@ dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
>>              }
>>          }
>>  
>> -        if (port_no < keep_channels_nbits) {
>> -            bitmap_set1(keep_channels, port_no);
>> -        }
>>          continue;
>>  
>>      error:
>> -        vport_del_channels(dpif, vport.port_no);
>> +        vport_del_channel(dpif, vport.port_no);
>>      }
>> +
>>      nl_dump_done(&dump);
>>      ofpbuf_uninit(&buf);
>>  
>> -    /* Discard any saved channels that we didn't reuse. */
>> -    for (i = 0; i < keep_channels_nbits; i++) {
>> -        if (!bitmap_is_set(keep_channels, i)) {
>> -            vport_del_channels(dpif, u32_to_odp(i));
>> -        }
>> -    }
>> -    free(keep_channels);
>> -
>>      return retval;
>>  }
>>  
>> @@ -2714,6 +2748,8 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>>      OVS_REQ_RDLOCK(dpif->upcall_lock)
>>  {
>>      struct dpif_handler *handler;
>> +    size_t n_channels;
>> +
>>      int read_tries = 0;
>>  
>>      if (!dpif->handlers || handler_id >= dpif->n_handlers) {
>> @@ -2721,14 +2757,15 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>>      }
>>  
>>      handler = &dpif->handlers[handler_id];
>> -    if (handler->event_offset >= handler->n_events) {
>> +    n_channels = dpif_handler_channels_count(handler);
>> +    if (handler->event_offset >= handler->n_events && n_channels) {
>>          int retval;
>>  
>>          handler->event_offset = handler->n_events = 0;
>>  
>>          do {
>>              retval = epoll_wait(handler->epoll_fd, handler->epoll_events,
>> -                                dpif->uc_array_size, 0);
>> +                                n_channels, 0);
>>          } while (retval < 0 && errno == EINTR);
>>  
>>          if (retval < 0) {
>> @@ -2741,7 +2778,10 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>>  
>>      while (handler->event_offset < handler->n_events) {
>>          int idx = handler->epoll_events[handler->event_offset].data.u32;
>> -        struct dpif_channel *ch = &dpif->channels[idx];
>> +        struct dpif_channel *ch = dpif_handler_channel_find(handler, idx);
>> +        if (!ch) {
>> +            return EAGAIN;
>
> How can it be that handler received event on port that it doesn't have channel for?
> Maybe assertion will be better here?

Hrrm... I guess the only case I can imagine is if during upcall
processing we remove a port?  I don't know if it can actually happen
though.  I agree, maybe an assert / log would be better.

>> +        }
>>  
>>          handler->event_offset++;
>>  
>> @@ -2845,12 +2885,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif)
>>      if (dpif->handlers) {
>>          size_t i;
>>  
>> -        if (!dpif->channels[0].sock) {
>> -            return;
>> -        }
>> -        for (i = 0; i < dpif->uc_array_size; i++ ) {
>> +        for (i = 0; i < dpif->n_handlers; i++) {
>> +            struct dpif_handler *handler = &dpif->handlers[i];
>> +            struct dpif_channel *channel;
>> +
>> +            HMAP_FOR_EACH (channel, hmap_node, &handler->channels) {
>> +                nl_sock_drain(channel->sock);
>> +            }
>>  
>> -            nl_sock_drain(dpif->channels[i].sock);
>>          }
>>      }
>>  }
>> diff --git a/lib/id-pool.c b/lib/id-pool.c
>> index 69910ad08..bef822f6b 100644
>> --- a/lib/id-pool.c
>> +++ b/lib/id-pool.c
>> @@ -93,6 +93,16 @@ id_pool_find(struct id_pool *pool, uint32_t id)
>>      return NULL;
>>  }
>>  
>> +bool
>> +id_pool_has_id(struct id_pool *pool, uint32_t id)
>> +{
>> +    if (!id_pool_find(pool, id)) {
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>>  void
>>  id_pool_add(struct id_pool *pool, uint32_t id)
>>  {
>> diff --git a/lib/id-pool.h b/lib/id-pool.h
>> index 8721f8793..62876e2a5 100644
>> --- a/lib/id-pool.h
>> +++ b/lib/id-pool.h
>> @@ -29,6 +29,7 @@ void id_pool_destroy(struct id_pool *);
>>  bool id_pool_alloc_id(struct id_pool *, uint32_t *id);
>>  void id_pool_free_id(struct id_pool *, uint32_t id);
>>  void id_pool_add(struct id_pool *, uint32_t id);
>> +bool id_pool_has_id(struct id_pool *, uint32_t id);
>>  
>>  /*
>>   * ID pool.
>>
Mark Gray Sept. 18, 2020, 2:33 p.m. UTC | #6
On 17/09/2020 20:23, Flavio Leitner wrote:
> On Thu, Sep 17, 2020 at 08:43:38PM +0200, Ilya Maximets wrote:
>> On 9/9/20 7:08 PM, Mark Gray wrote:
>>> From: Aaron Conole <aconole@redhat.com>

>> What is the reason of having this pool?  Datapath port numbers are unique
>> except for ODPP_NONE, and all the functions you implemented actually
>> iterates over all channels anyway to find one with requested port_idx.
>> So, the only purpose of it is to avoid searching on incorect port number,
>> but this should be a rare case.  Can we just remove it?
>>
>> OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
>> instead as it already has all required infrastructure and doesn't actually
>> care about range of values.
> 
> Well, now it iterates over handlers looking for a channel that has
> a port, so maybe it's better to have ports hashed so we can get the
> channel?

So each channel will be in two maps, one global one for fast searches
and to check for duplicates and a per-handler one?
Mark Gray Sept. 18, 2020, 4:31 p.m. UTC | #7
On 17/09/2020 20:49, Aaron Conole wrote:
> Ilya Maximets <i.maximets@ovn.org> writes:
> 
>> On 9/9/20 7:08 PM, Mark Gray wrote:
>>> From: Aaron Conole <aconole@redhat.com>
>>>
>>> Currently, the channel handlers are polled globally.  On some
>>> systems, this causes a thundering herd issue where multiple
>>> handler threads become active, only to do no work and immediately
>>> sleep.
>>>
>>> The approach here is to push the netlink socket channels to discreet
>>> handler threads to process, rather than polling on every thread.
>>> This will eliminate the need to wake multiple threads.
>>>
>>> To check:
>>>
>>>   ip netns add left
>>>   ip netns add right
>>>   ip link add center-left type veth peer name left0 netns left
>>>   ip link add center-right type veth peer name right0 netns right
>>>   ip link set center-left up
>>>   ip link set center-right up
>>>   ip -n left ip link set left0 up
>>>   ip -n left ip addr add 172.31.110.10/24 dev left0
>>>   ip -n right ip link set right0 up
>>>   ip -n right ip addr add 172.31.110.11/24 dev right0
>>>
>>>   ovs-vsctl add-br br0
>>>   ovs-vsctl add-port br0 center-right
>>>   ovs-vsctl add-port br0 center-left
>>>
>>>   # in one terminal
>>>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
>>>
>>>   # in a separate terminal
>>>   ip netns exec left arping -I left0 -c 1 172.31.110.11
>>>
>>>   # in the perf terminal after exiting
>>>   perf script
>>>
>>> Look for the number of 'handler' threads which were made active.
>>>
>>> Suggested-by: Ben Pfaff <blp@ovn.org>
>>> Co-authored-by: Mark Gray <mark.d.gray@redhat.com>
>>> Reported-by: David Ahern <dsahern@gmail.com>
>>> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
>>> Cc: Matteo Croce <technoboy85@gmail.com>
>>> Cc: Flavio Leitner <fbl@sysclose.org>
>>> Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets")
>>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>>> Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
>>> ---
>>>
>>>
>>> v2: Oops - forgot to commit my whitespace cleanups.
>>> v3: one port latency results as per Matteo's comments
>>>
>>>     Stock:
>>>       min/avg/max/mdev = 21.5/36.5/96.5/1.0 us
>>>     With Patch:
>>>       min/avg/max/mdev = 5.3/9.7/98.4/0.5 us 
>>> v4: Oops - first email did not send
>>> v5:   min/avg/max/mdev = 9.3/10.4/33.6/2.2 us
>>> v6: Split out the AUTHORS patch and added a cover letter as
>>>     per Ilya's suggestion.
>>>     Fixed 0-day issues.
>>> v7: Merged patches as per Flavio's suggestion. This is
>>>     no longer a series. Fixed some other small issues.
>>>
>>>  lib/dpif-netlink.c | 330 +++++++++++++++++++++++++--------------------
>>>  lib/id-pool.c      |  10 ++
>>>  lib/id-pool.h      |   1 +
>>>  3 files changed, 197 insertions(+), 144 deletions(-)
>>>
>>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
>>> index 7da4fb54d..a74838506 100644
>>> --- a/lib/dpif-netlink.c
>>> +++ b/lib/dpif-netlink.c
>>> @@ -37,6 +37,7 @@
>>>  #include "dpif-provider.h"
>>>  #include "fat-rwlock.h"
>>>  #include "flow.h"
>>> +#include "id-pool.h"
>>>  #include "netdev-linux.h"
>>>  #include "netdev-offload.h"
>>>  #include "netdev-provider.h"
>>> @@ -160,6 +161,9 @@ static void dpif_netlink_flow_to_dpif_flow(struct dpif_flow *,
>>>  
>>>  /* One of the dpif channels between the kernel and userspace. */
>>>  struct dpif_channel {
>>> +    struct hmap_node hmap_node;
>>
>> This needs a comment and all other new structure fileds too.
>> Ex.:
>>     struct hmap_node hmap_node; /* In struct dpif_handler's channels hmap. */
>>
>>> +    struct dpif_handler *handler;

Ok, updated

>>
>> Why do we need this back-pointer?  Maybe it's better to just add 'handler'
>> argument to dpif_channel_del() ?

I have kept this as it is required due to the rework that Flavio has
suggested in another thread.

>>
>>> +    uint32_t port_idx;
>>
>> I see a lot of 'port_idx' here and there, but it's not really an index anymore.
>> You removed all the arrays that used these variables as index.
>> Can we just use 'port_no' directly here?  I know that there will be conversions
>> from odp to u32, but this should be more clear.  If conversions are the issue,
>> we could store u32 version of 'port_no' here.

I think this is a good idea. 'port_idx' is littered around this file in
various places and it caused me an issue it a previous revision because
the naming convention was not consistent. I will just remove all
references to it.

>>
>> And, please, add a comment.
>>
>>>      struct nl_sock *sock;       /* Netlink socket. */
>>>      long long int last_poll;    /* Last time this channel was polled. */
>>>  };
>>> @@ -183,6 +187,8 @@ struct dpif_handler {
>>>      int n_events;                 /* Num events returned by epoll_wait(). */
>>>      int event_offset;             /* Offset into 'epoll_events'. */
>>>  
>>> +    struct hmap channels;         /* map of channels for each port. */
>>
>> Comment style: Capital letter.

Ok

>>
>>> +
>>>  #ifdef _WIN32
>>>      /* Pool of sockets. */
>>>      struct dpif_windows_vport_sock *vport_sock_pool;
>>> @@ -201,9 +207,8 @@ struct dpif_netlink {
>>>      struct fat_rwlock upcall_lock;
>>>      struct dpif_handler *handlers;
>>>      uint32_t n_handlers;           /* Num of upcall handlers. */
>>> -    struct dpif_channel *channels; /* Array of channels for each port. */
>>> -    int uc_array_size;             /* Size of 'handler->channels' and */
>>> -                                   /* 'handler->epoll_events'. */
>>> +
>>> +    struct id_pool *port_ids;      /* Set of added port ids */
>>
>> What is the reason of having this pool?  Datapath port numbers are unique
>> except for ODPP_NONE, and all the functions you implemented actually
>> iterates over all channels anyway to find one with requested port_idx.
>> So, the only purpose of it is to avoid searching on incorect port number,
>> but this should be a rare case.  Can we just remove it?
>>
>> OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
>> instead as it already has all required infrastructure and doesn't actually
>> care about range of values.

I have replaced this with a hmap of all the channels as suggested in
another thread by Flavio.

>>
>>>  
>>>      /* Change notification. */
>>>      struct nl_sock *port_notifier; /* vport multicast group subscriber. */
>>> @@ -287,6 +292,44 @@ close_nl_sock(struct nl_sock *sock)
>>>  #endif
>>>  }
>>>  
>>> +static size_t
>>> +dpif_handler_channels_count(const struct dpif_handler *handler)
>>> +{
>>> +    return hmap_count(&handler->channels);
>>> +}
>>> +
>>> +static struct dpif_channel *
>>> +dpif_handler_channel_find(struct dpif_handler *handler, uint32_t idx)
>>> +{
>>> +    struct dpif_channel *channel;
>>> +
>>> +    HMAP_FOR_EACH_WITH_HASH (channel, hmap_node, hash_int(idx, 0),
>>> +                             &handler->channels) {
>>> +        if (channel->port_idx == idx) {
>>> +            return channel;
>>> +        }
>>> +    }
>>> +
>>> +    return NULL;
>>> +}
>>> +
>>> +static void
>>> +dpif_channel_del(struct dpif_netlink *dpif,
>>> +                 struct dpif_channel *channel)
>>
>> As suggested above, it might make sense to just pass 'handler' here and
>> avoid having back-pointer inside the channel structure.

I have left this like this as I have reworked based on Flavio's suggestion.

>>
>>> +{
>>> +    struct dpif_handler *handler = channel->handler;
>>> +
>>> +#ifndef _WIN32
>>> +    epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
>>> +              nl_sock_fd(channel->sock), NULL);
>>> +    nl_sock_destroy(channel->sock);
>>> +#endif
>>> +    hmap_remove(&handler->channels, &channel->hmap_node);
>>> +    id_pool_free_id(dpif->port_ids, channel->port_idx);
>>> +    handler->event_offset = handler->n_events = 0;
>>> +    free(channel);
>>> +}
>>> +
>>>  static struct dpif_netlink *
>>>  dpif_netlink_cast(const struct dpif *dpif)
>>>  {
>>> @@ -385,6 +428,7 @@ open_dpif(const struct dpif_netlink_dp *dp, struct dpif **dpifp)
>>>  
>>>      dpif->dp_ifindex = dp->dp_ifindex;
>>>      dpif->user_features = dp->user_features;
>>> +    dpif->port_ids = id_pool_create(0, MAX_PORTS);
>>>      *dpifp = &dpif->dpif;
>>>  
>>>      return 0;
>>> @@ -446,167 +490,151 @@ error:
>>>  }
>>>  #endif /* _WIN32 */
>>>  
>>> -/* Given the port number 'port_idx', extracts the pid of netlink socket
>>> +/* Given the port number 'port_no', extracts the pid of netlink socket
>>>   * associated to the port and assigns it to 'upcall_pid'. */
>>>  static bool
>>> -vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx,
>>> +vport_get_pid(struct dpif_netlink *dpif, odp_port_t port_no,
>>>                uint32_t *upcall_pid)
>>
>> This function seems redundant as it does almost same thing as
>> dpif_netlink_port_get_pid__().  Can we combine them?

Yes, I guess it could be combined.

>>
>>>  {
>>> -    /* Since the nl_sock can only be assigned in either all
>>> -     * or none "dpif" channels, the following check
>>> -     * would suffice. */
>>> -    if (!dpif->channels[port_idx].sock) {
>>> +    uint32_t port_idx = odp_to_u32(port_no);
>>> +    size_t i;
>>> +
>>> +    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
>>> +
>>> +    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
>>>          return false;
>>>      }
>>> -    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
>>>  
>>> -    *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock);
>>> +    /* The 'port_idx' should only be valid for a single handler. */
>>> +    for (i = 0; i < dpif->n_handlers; i++) {
>>> +        struct dpif_channel *channel;
>>> +
>>> +        channel = dpif_handler_channel_find(&dpif->handlers[i], port_idx);
>>> +        if (channel) {
>>> +            *upcall_pid = nl_sock_pid(channel->sock);
>>> +            return true;
>>> +        }
>>> +    }
>>>  
>>> -    return true;
>>> +    return false;
>>>  }
>>>  
>>>  static int
>>>  vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
>>>                    struct nl_sock *sock)
>>>  {
>>> -    struct epoll_event event;
>>>      uint32_t port_idx = odp_to_u32(port_no);
>>> +    struct dpif_channel *channel;
>>> +    struct dpif_handler *handler;
>>> +    struct epoll_event event;
>>>      size_t i;
>>> -    int error;
>>>  
>>> -    if (dpif->handlers == NULL) {
>>> +    if (dpif->handlers == NULL ||
>>> +        id_pool_has_id(dpif->port_ids, port_idx)) {
>>>          close_nl_sock(sock);
>>> -        return 0;
>>> +        return EINVAL;
>>>      }
>>>  
>>> -    /* We assume that the datapath densely chooses port numbers, which can
>>> -     * therefore be used as an index into 'channels' and 'epoll_events' of
>>> -     * 'dpif'. */
>>> -    if (port_idx >= dpif->uc_array_size) {
>>> -        uint32_t new_size = port_idx + 1;
>>> -
>>> -        if (new_size > MAX_PORTS) {
>>> -            VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
>>> -                         dpif_name(&dpif->dpif), port_no);
>>> -            return EFBIG;
>>> -        }
>>> +    if (port_idx >= MAX_PORTS) {
>>> +        VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
>>> +                     dpif_name(&dpif->dpif), port_no);
>>
>> This looks a bit wierd since we're checking port_idx and printing port_no
>> instead.  Might be refined if we will just use port_no everywhere.

Yes, it will be.

>>
>>> +        return EFBIG;
>>> +    }
>>>  
>>> -        dpif->channels = xrealloc(dpif->channels,
>>> -                                  new_size * sizeof *dpif->channels);
>>> +    handler = &dpif->handlers[0];
>>>  
>>> -        for (i = dpif->uc_array_size; i < new_size; i++) {
>>> -            dpif->channels[i].sock = NULL;
>>> +    for (i = 0; i < dpif->n_handlers; i++) {
>>> +        if (dpif_handler_channels_count(&dpif->handlers[i]) <
>>> +            dpif_handler_channels_count(handler)) {
>>> +            handler = &dpif->handlers[i];
>>>          }
>>> +    }
>>>  
>>> -        for (i = 0; i < dpif->n_handlers; i++) {
>>> -            struct dpif_handler *handler = &dpif->handlers[i];
>>> +    channel = xmalloc(sizeof *channel);
>>> +    channel->port_idx = port_idx;
>>> +    channel->sock = sock;
>>> +    channel->last_poll = LLONG_MIN;
>>> +    channel->handler = handler;
>>> +    hmap_insert(&handler->channels, &channel->hmap_node,
>>> +                hash_int(port_idx, 0));
>>>  
>>> -            handler->epoll_events = xrealloc(handler->epoll_events,
>>> -                new_size * sizeof *handler->epoll_events);
>>> -
>>> -        }
>>> -        dpif->uc_array_size = new_size;
>>> -    }
>>> +    handler->epoll_events = xrealloc(handler->epoll_events,
>>> +                                     dpif_handler_channels_count(handler) *
>>> +                                     sizeof *handler->epoll_events);
>>>  
>>>      memset(&event, 0, sizeof event);
>>>      event.events = EPOLLIN | EPOLLEXCLUSIVE;
>>>      event.data.u32 = port_idx;
>>>  
>>> -    for (i = 0; i < dpif->n_handlers; i++) {
>>> -        struct dpif_handler *handler = &dpif->handlers[i];
>>> -
>>>  #ifndef _WIN32
>>> -        if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
>>> -                      &event) < 0) {
>>> -            error = errno;
>>> -            goto error;
>>> -        }
>>> -#endif
>>> -    }
>>> -    dpif->channels[port_idx].sock = sock;
>>> -    dpif->channels[port_idx].last_poll = LLONG_MIN;
>>> -
>>> -    return 0;
>>> -
>>> -error:
>>> -#ifndef _WIN32
>>> -    while (i--) {
>>> -        epoll_ctl(dpif->handlers[i].epoll_fd, EPOLL_CTL_DEL,
>>> -                  nl_sock_fd(sock), NULL);
>>> +    if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
>>> +                  &event) < 0) {
>>> +        hmap_remove(&handler->channels, &channel->hmap_node);
>>> +        free(channel);
>>> +        id_pool_free_id(dpif->port_ids, port_idx);
>>> +        return errno;
>>>      }
>>>  #endif
>>> -    dpif->channels[port_idx].sock = NULL;
>>>  
>>> -    return error;
>>> +    id_pool_add(dpif->port_ids, port_idx);
>>> +    return 0;
>>>  }
>>>  
>>> -static void
>>> -vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no)
>>> +static bool
>>> +vport_del_channel(struct dpif_netlink *dpif, odp_port_t port_no)
>>>  {
>>> -    uint32_t port_idx = odp_to_u32(port_no);
>>>      size_t i;
>>>  
>>> -    if (!dpif->handlers || port_idx >= dpif->uc_array_size
>>> -        || !dpif->channels[port_idx].sock) {
>>> -        return;
>>> -    }
>>> -
>>>      for (i = 0; i < dpif->n_handlers; i++) {
>>> -        struct dpif_handler *handler = &dpif->handlers[i];
>>> -#ifndef _WIN32
>>> -        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
>>> -                  nl_sock_fd(dpif->channels[port_idx].sock), NULL);
>>> -#endif
>>> -        handler->event_offset = handler->n_events = 0;
>>> +        struct dpif_channel *channel;
>>> +
>>> +        channel = dpif_handler_channel_find(&dpif->handlers[i],
>>> +                                            odp_to_u32(port_no));
>>> +        if (channel) {
>>> +            dpif_channel_del(dpif, channel);
>>> +            return true;
>>> +        }
>>>      }
>>> -#ifndef _WIN32
>>> -    nl_sock_destroy(dpif->channels[port_idx].sock);
>>> -#endif
>>> -    dpif->channels[port_idx].sock = NULL;
>>> +
>>> +    return false;
>>>  }
>>>  
>>>  static void
>>>  destroy_all_channels(struct dpif_netlink *dpif)
>>>      OVS_REQ_WRLOCK(dpif->upcall_lock)
>>>  {
>>> -    unsigned int i;
>>> +    size_t i;
>>>  
>>>      if (!dpif->handlers) {
>>>          return;
>>>      }
>>>  
>>> -    for (i = 0; i < dpif->uc_array_size; i++ ) {
>>> -        struct dpif_netlink_vport vport_request;
>>> -        uint32_t upcall_pids = 0;
>>> -
>>> -        if (!dpif->channels[i].sock) {
>>> -            continue;
>>> -        }
>>> -
>>> -        /* Turn off upcalls. */
>>> -        dpif_netlink_vport_init(&vport_request);
>>> -        vport_request.cmd = OVS_VPORT_CMD_SET;
>>> -        vport_request.dp_ifindex = dpif->dp_ifindex;
>>> -        vport_request.port_no = u32_to_odp(i);
>>> -        vport_request.n_upcall_pids = 1;
>>> -        vport_request.upcall_pids = &upcall_pids;
>>> -        dpif_netlink_vport_transact(&vport_request, NULL, NULL);
>>> -
>>> -        vport_del_channels(dpif, u32_to_odp(i));
>>> -    }
>>> -
>>>      for (i = 0; i < dpif->n_handlers; i++) {
>>>          struct dpif_handler *handler = &dpif->handlers[i];
>>> +        struct dpif_channel *channel, *next;
>>> +
>>> +        HMAP_FOR_EACH_SAFE (channel, next, hmap_node, &handler->channels) {
>>> +            struct dpif_netlink_vport vport_request;
>>> +            uint32_t upcall_pids = 0;
>>
>> Empty line, please.

OK

>>
>>> +            /* Turn off upcalls. */
>>> +            dpif_netlink_vport_init(&vport_request);
>>> +            vport_request.cmd = OVS_VPORT_CMD_SET;
>>> +            vport_request.dp_ifindex = dpif->dp_ifindex;
>>> +            vport_request.port_no = u32_to_odp(channel->port_idx);
>>> +            vport_request.n_upcall_pids = 1;
>>> +            vport_request.upcall_pids = &upcall_pids;
>>> +            dpif_netlink_vport_transact(&vport_request, NULL, NULL);
>>> +            ovs_assert(channel != NULL);
>>> +            dpif_channel_del(dpif, channel);
>>> +        }
>>>  
>>>          dpif_netlink_handler_uninit(handler);
>>>          free(handler->epoll_events);
>>>      }
>>> -    free(dpif->channels);
>>> +
>>>      free(dpif->handlers);
>>>      dpif->handlers = NULL;
>>> -    dpif->channels = NULL;
>>>      dpif->n_handlers = 0;
>>> -    dpif->uc_array_size = 0;
>>>  }
>>>  
>>>  static void
>>> @@ -618,9 +646,11 @@ dpif_netlink_close(struct dpif *dpif_)
>>>  
>>>      fat_rwlock_wrlock(&dpif->upcall_lock);
>>>      destroy_all_channels(dpif);
>>> +    id_pool_destroy(dpif->port_ids);
>>>      fat_rwlock_unlock(&dpif->upcall_lock);
>>>  
>>>      fat_rwlock_destroy(&dpif->upcall_lock);
>>> +
>>>      free(dpif);
>>>  }
>>>  
>>> @@ -1003,7 +1033,9 @@ dpif_netlink_port_del__(struct dpif_netlink *dpif, odp_port_t port_no)
>>>  #endif
>>>      error = dpif_netlink_vport_transact(&vport, NULL, NULL);
>>>  
>>> -    vport_del_channels(dpif, port_no);
>>> +    if (!vport_del_channel(dpif, port_no)) {
>>
>> If packets receiving disabled with dpif_recv_set(), there will be no channels
>> and this check will fail.  And if it we will return here we will leave tunnel
>> port in a system which is not a good thing.
>>
>> And we will also leak 'dpif_port' here.
> 
> Agreed.

Good catch Ilya. I think the rework based on Flavio's comment will
change this but I have taken it into consideration in the rework.

> 
>>> +        return EINVAL;
>>> +    }
>>>  
>>>      if (!error && !ovs_tunnels_out_of_tree) {
>>>          error = dpif_netlink_rtnl_port_destroy(dpif_port.name, dpif_port.type);
>>> @@ -1084,23 +1116,39 @@ dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif,
>>>                              odp_port_t port_no)
>>>      OVS_REQ_RDLOCK(dpif->upcall_lock)
>>>  {
>>> +    struct dpif_channel *min_channel = NULL, *found_channel = NULL;
>>>      uint32_t port_idx = odp_to_u32(port_no);
>>>      uint32_t pid = 0;
>>> +    size_t i;
>>> +
>>> +    /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
>>> +     * channel, since it is not heavily loaded. */
>>> +    uint32_t idx = port_no == ODPP_NONE ? 0 : port_idx;
>>> +
>>> +    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (dpif->handlers) {
>>> +        for (i = 0; i < dpif->n_handlers; i++) {
>>> +            if (dpif_handler_channel_find(&dpif->handlers[i], idx)) {
>>> +                found_channel = dpif_handler_channel_find(&dpif->handlers[i],
>>> +                                                          idx);
>>
>> Looking up twice seems inefficient.  Same for the 0 case below.

Agreed, I have reworked this.

>>
>>> +                break;
>>> +            }
>>>  
>>> -    if (dpif->handlers && dpif->uc_array_size > 0) {
>>> -        /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
>>> -         * channel, since it is not heavily loaded. */
>>> -        uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx;
>>> -
>>> -        /* Needs to check in case the socket pointer is changed in between
>>> -         * the holding of upcall_lock.  A known case happens when the main
>>> -         * thread deletes the vport while the handler thread is handling
>>> -         * the upcall from that port. */
>>> -        if (dpif->channels[idx].sock) {
>>> -            pid = nl_sock_pid(dpif->channels[idx].sock);
>>> +            if (dpif_handler_channel_find(&dpif->handlers[i], 0)) {
>>> +                min_channel = dpif_handler_channel_find(&dpif->handlers[i], 0);
>>> +            }
>>>          }
>>>      }
>>>  
>>> +    if (found_channel) {
>>> +        pid = nl_sock_pid(found_channel->sock);
>>> +    } else if (min_channel) {
>>> +        pid = nl_sock_pid(min_channel->sock);
>>
>> This might be a question for Aaron mostly.
>>
>> Could you, please, elaborate why we have a behavior change here?
>> Old code just returnd 0 as a pid, but after this change we will return
>> pid of a channel of port 0.
> 
> I see what you mean - previously, we only got pid for index 0 when the
> port index was out of range.  Now we will get it even when the port
> index is in range, but not found.  That shouldn't be - it is my mistake.

I modified this a bit to combine, simplify and I think also make a bit
more efficient. Please give it a review.
> 
>>> +    }
>>> +
>>>      return pid;
>>>  }
>>>  
>>> @@ -2342,12 +2390,14 @@ dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops,
>>>  static void
>>>  dpif_netlink_handler_uninit(struct dpif_handler *handler)
>>>  {
>>> +    hmap_destroy(&handler->channels);
>>>      vport_delete_sock_pool(handler);
>>>  }
>>>  
>>>  static int
>>>  dpif_netlink_handler_init(struct dpif_handler *handler)
>>>  {
>>> +    hmap_init(&handler->channels);
>>>      return vport_create_sock_pool(handler);
>>>  }
>>>  #else
>>> @@ -2355,6 +2405,7 @@ dpif_netlink_handler_init(struct dpif_handler *handler)
>>>  static int
>>>  dpif_netlink_handler_init(struct dpif_handler *handler)
>>>  {
>>> +    hmap_init(&handler->channels);
>>>      handler->epoll_fd = epoll_create(10);
>>>      return handler->epoll_fd < 0 ? errno : 0;
>>>  }
>>> @@ -2362,6 +2413,7 @@ dpif_netlink_handler_init(struct dpif_handler *handler)
>>>  static void
>>>  dpif_netlink_handler_uninit(struct dpif_handler *handler)
>>>  {
>>> +    hmap_destroy(&handler->channels);
>>>      close(handler->epoll_fd);
>>>  }
>>>  #endif
>>> @@ -2374,9 +2426,7 @@ static int
>>>  dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
>>>      OVS_REQ_WRLOCK(dpif->upcall_lock)
>>>  {
>>> -    unsigned long int *keep_channels;
>>>      struct dpif_netlink_vport vport;
>>> -    size_t keep_channels_nbits;
>>>      struct nl_dump dump;
>>>      uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
>>>      struct ofpbuf buf;
>>> @@ -2412,22 +2462,16 @@ dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
>>>  
>>>      for (i = 0; i < n_handlers; i++) {
>>>          struct dpif_handler *handler = &dpif->handlers[i];
>>> -
>>
>> Please, keep.

OK

>>
>>>          handler->event_offset = handler->n_events = 0;
>>>      }
>>>  
>>> -    keep_channels_nbits = dpif->uc_array_size;
>>> -    keep_channels = bitmap_allocate(keep_channels_nbits);
>>> -
>>>      ofpbuf_use_stub(&buf, reply_stub, sizeof reply_stub);
>>>      dpif_netlink_port_dump_start__(dpif, &dump);
>>>      while (!dpif_netlink_port_dump_next__(dpif, &dump, &vport, &buf)) {
>>> -        uint32_t port_no = odp_to_u32(vport.port_no);
>>>          uint32_t upcall_pid;
>>>          int error;
>>>  
>>> -        if (port_no >= dpif->uc_array_size
>>> -            || !vport_get_pid(dpif, port_no, &upcall_pid)) {
>>> +        if (!vport_get_pid(dpif, vport.port_no, &upcall_pid)) {
>>>              struct nl_sock *sock;
>>>              error = create_nl_sock(dpif, &sock);
>>>  
>>> @@ -2475,25 +2519,15 @@ dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
>>>              }
>>>          }
>>>  
>>> -        if (port_no < keep_channels_nbits) {
>>> -            bitmap_set1(keep_channels, port_no);
>>> -        }
>>>          continue;
>>>  
>>>      error:
>>> -        vport_del_channels(dpif, vport.port_no);
>>> +        vport_del_channel(dpif, vport.port_no);
>>>      }
>>> +
>>>      nl_dump_done(&dump);
>>>      ofpbuf_uninit(&buf);
>>>  
>>> -    /* Discard any saved channels that we didn't reuse. */
>>> -    for (i = 0; i < keep_channels_nbits; i++) {
>>> -        if (!bitmap_is_set(keep_channels, i)) {
>>> -            vport_del_channels(dpif, u32_to_odp(i));
>>> -        }
>>> -    }
>>> -    free(keep_channels);
>>> -
>>>      return retval;
>>>  }
>>>  
>>> @@ -2714,6 +2748,8 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>>>      OVS_REQ_RDLOCK(dpif->upcall_lock)
>>>  {
>>>      struct dpif_handler *handler;
>>> +    size_t n_channels;
>>> +
>>>      int read_tries = 0;
>>>  
>>>      if (!dpif->handlers || handler_id >= dpif->n_handlers) {
>>> @@ -2721,14 +2757,15 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>>>      }
>>>  
>>>      handler = &dpif->handlers[handler_id];
>>> -    if (handler->event_offset >= handler->n_events) {
>>> +    n_channels = dpif_handler_channels_count(handler);
>>> +    if (handler->event_offset >= handler->n_events && n_channels) {
>>>          int retval;
>>>  
>>>          handler->event_offset = handler->n_events = 0;
>>>  
>>>          do {
>>>              retval = epoll_wait(handler->epoll_fd, handler->epoll_events,
>>> -                                dpif->uc_array_size, 0);
>>> +                                n_channels, 0);
>>>          } while (retval < 0 && errno == EINTR);
>>>  
>>>          if (retval < 0) {
>>> @@ -2741,7 +2778,10 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>>>  
>>>      while (handler->event_offset < handler->n_events) {
>>>          int idx = handler->epoll_events[handler->event_offset].data.u32;
>>> -        struct dpif_channel *ch = &dpif->channels[idx];
>>> +        struct dpif_channel *ch = dpif_handler_channel_find(handler, idx);
>>> +        if (!ch) {
>>> +            return EAGAIN;
>>
>> How can it be that handler received event on port that it doesn't have channel for?
>> Maybe assertion will be better here?
> 
> Hrrm... I guess the only case I can imagine is if during upcall
> processing we remove a port?  I don't know if it can actually happen
> though.  I agree, maybe an assert / log would be better.

Agreed, I have added an assert (and another one, based on the rework
Flavio suggested)

> 
>>> +        }
>>>  
>>>          handler->event_offset++;
>>>  
>>> @@ -2845,12 +2885,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif)
>>>      if (dpif->handlers) {
>>>          size_t i;
>>>  
>>> -        if (!dpif->channels[0].sock) {
>>> -            return;
>>> -        }
>>> -        for (i = 0; i < dpif->uc_array_size; i++ ) {
>>> +        for (i = 0; i < dpif->n_handlers; i++) {
>>> +            struct dpif_handler *handler = &dpif->handlers[i];
>>> +            struct dpif_channel *channel;
>>> +
>>> +            HMAP_FOR_EACH (channel, hmap_node, &handler->channels) {
>>> +                nl_sock_drain(channel->sock);
>>> +            }
>>>  
>>> -            nl_sock_drain(dpif->channels[i].sock);
>>>          }
>>>      }
>>>  }
>>> diff --git a/lib/id-pool.c b/lib/id-pool.c
>>> index 69910ad08..bef822f6b 100644
>>> --- a/lib/id-pool.c
>>> +++ b/lib/id-pool.c
>>> @@ -93,6 +93,16 @@ id_pool_find(struct id_pool *pool, uint32_t id)
>>>      return NULL;
>>>  }
>>>  
>>> +bool
>>> +id_pool_has_id(struct id_pool *pool, uint32_t id)
>>> +{
>>> +    if (!id_pool_find(pool, id)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>>  void
>>>  id_pool_add(struct id_pool *pool, uint32_t id)
>>>  {
>>> diff --git a/lib/id-pool.h b/lib/id-pool.h
>>> index 8721f8793..62876e2a5 100644
>>> --- a/lib/id-pool.h
>>> +++ b/lib/id-pool.h
>>> @@ -29,6 +29,7 @@ void id_pool_destroy(struct id_pool *);
>>>  bool id_pool_alloc_id(struct id_pool *, uint32_t *id);
>>>  void id_pool_free_id(struct id_pool *, uint32_t id);
>>>  void id_pool_add(struct id_pool *, uint32_t id);
>>> +bool id_pool_has_id(struct id_pool *, uint32_t id);
>>>  
>>>  /*
>>>   * ID pool.
>>>
>
Mark Gray Sept. 18, 2020, 4:32 p.m. UTC | #8
On 18/09/2020 15:33, Mark Gray wrote:
> On 17/09/2020 20:23, Flavio Leitner wrote:
>> On Thu, Sep 17, 2020 at 08:43:38PM +0200, Ilya Maximets wrote:
>>> On 9/9/20 7:08 PM, Mark Gray wrote:
>>>> From: Aaron Conole <aconole@redhat.com>
> 
>>> What is the reason of having this pool?  Datapath port numbers are unique
>>> except for ODPP_NONE, and all the functions you implemented actually
>>> iterates over all channels anyway to find one with requested port_idx.
>>> So, the only purpose of it is to avoid searching on incorect port number,
>>> but this should be a rare case.  Can we just remove it?
>>>
>>> OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
>>> instead as it already has all required infrastructure and doesn't actually
>>> care about range of values.
>>
>> Well, now it iterates over handlers looking for a channel that has
>> a port, so maybe it's better to have ports hashed so we can get the
>> channel?
> 
> So each channel will be in two maps, one global one for fast searches
> and to check for duplicates and a per-handler one?
> 

I will make this change. It will be a significant rework but I think it
will make a different to the efficiency and clarity of the code.
Flavio Leitner Sept. 18, 2020, 5:47 p.m. UTC | #9
On Fri, Sep 18, 2020 at 05:32:41PM +0100, Mark Gray wrote:
> On 18/09/2020 15:33, Mark Gray wrote:
> > On 17/09/2020 20:23, Flavio Leitner wrote:
> >> On Thu, Sep 17, 2020 at 08:43:38PM +0200, Ilya Maximets wrote:
> >>> On 9/9/20 7:08 PM, Mark Gray wrote:
> >>>> From: Aaron Conole <aconole@redhat.com>
> > 
> >>> What is the reason of having this pool?  Datapath port numbers are unique
> >>> except for ODPP_NONE, and all the functions you implemented actually
> >>> iterates over all channels anyway to find one with requested port_idx.
> >>> So, the only purpose of it is to avoid searching on incorect port number,
> >>> but this should be a rare case.  Can we just remove it?
> >>>
> >>> OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
> >>> instead as it already has all required infrastructure and doesn't actually
> >>> care about range of values.
> >>
> >> Well, now it iterates over handlers looking for a channel that has
> >> a port, so maybe it's better to have ports hashed so we can get the
> >> channel?
> > 
> > So each channel will be in two maps, one global one for fast searches
> > and to check for duplicates and a per-handler one?
> > 
> 
> I will make this change. It will be a significant rework but I think it
> will make a different to the efficiency and clarity of the code.

I suggested that based on how the code is using the objects,
but I haven't exercised the idea enough to say if the solution
will look ok in the end.

Thanks,
Flavio Leitner Sept. 18, 2020, 5:53 p.m. UTC | #10
On Fri, Sep 18, 2020 at 05:31:30PM +0100, Mark Gray wrote:
> On 17/09/2020 20:49, Aaron Conole wrote:
> > Ilya Maximets <i.maximets@ovn.org> writes:
> > 
> >> On 9/9/20 7:08 PM, Mark Gray wrote:
> >>> From: Aaron Conole <aconole@redhat.com>
> >>>
> >>> Currently, the channel handlers are polled globally.  On some
> >>> systems, this causes a thundering herd issue where multiple
> >>> handler threads become active, only to do no work and immediately
> >>> sleep.
> >>>
> >>> The approach here is to push the netlink socket channels to discreet
> >>> handler threads to process, rather than polling on every thread.
> >>> This will eliminate the need to wake multiple threads.
> >>>
> >>> To check:
> >>>
> >>>   ip netns add left
> >>>   ip netns add right
> >>>   ip link add center-left type veth peer name left0 netns left
> >>>   ip link add center-right type veth peer name right0 netns right
> >>>   ip link set center-left up
> >>>   ip link set center-right up
> >>>   ip -n left ip link set left0 up
> >>>   ip -n left ip addr add 172.31.110.10/24 dev left0
> >>>   ip -n right ip link set right0 up
> >>>   ip -n right ip addr add 172.31.110.11/24 dev right0
> >>>
> >>>   ovs-vsctl add-br br0
> >>>   ovs-vsctl add-port br0 center-right
> >>>   ovs-vsctl add-port br0 center-left
> >>>
> >>>   # in one terminal
> >>>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
> >>>
> >>>   # in a separate terminal
> >>>   ip netns exec left arping -I left0 -c 1 172.31.110.11
> >>>
> >>>   # in the perf terminal after exiting
> >>>   perf script
> >>>
> >>> Look for the number of 'handler' threads which were made active.
> >>>
> >>> Suggested-by: Ben Pfaff <blp@ovn.org>
> >>> Co-authored-by: Mark Gray <mark.d.gray@redhat.com>
> >>> Reported-by: David Ahern <dsahern@gmail.com>
> >>> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
> >>> Cc: Matteo Croce <technoboy85@gmail.com>
> >>> Cc: Flavio Leitner <fbl@sysclose.org>
> >>> Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets")
> >>> Signed-off-by: Aaron Conole <aconole@redhat.com>
> >>> Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
> >>> ---
> >>>
> >>>
> >>> v2: Oops - forgot to commit my whitespace cleanups.
> >>> v3: one port latency results as per Matteo's comments
> >>>
> >>>     Stock:
> >>>       min/avg/max/mdev = 21.5/36.5/96.5/1.0 us
> >>>     With Patch:
> >>>       min/avg/max/mdev = 5.3/9.7/98.4/0.5 us 
> >>> v4: Oops - first email did not send
> >>> v5:   min/avg/max/mdev = 9.3/10.4/33.6/2.2 us
> >>> v6: Split out the AUTHORS patch and added a cover letter as
> >>>     per Ilya's suggestion.
> >>>     Fixed 0-day issues.
> >>> v7: Merged patches as per Flavio's suggestion. This is
> >>>     no longer a series. Fixed some other small issues.
> >>>
> >>>  lib/dpif-netlink.c | 330 +++++++++++++++++++++++++--------------------
> >>>  lib/id-pool.c      |  10 ++
> >>>  lib/id-pool.h      |   1 +
> >>>  3 files changed, 197 insertions(+), 144 deletions(-)
> >>>
> >>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> >>> index 7da4fb54d..a74838506 100644
> >>> --- a/lib/dpif-netlink.c
> >>> +++ b/lib/dpif-netlink.c
> >>> @@ -37,6 +37,7 @@
> >>>  #include "dpif-provider.h"
> >>>  #include "fat-rwlock.h"
> >>>  #include "flow.h"
> >>> +#include "id-pool.h"
> >>>  #include "netdev-linux.h"
> >>>  #include "netdev-offload.h"
> >>>  #include "netdev-provider.h"
> >>> @@ -160,6 +161,9 @@ static void dpif_netlink_flow_to_dpif_flow(struct dpif_flow *,
> >>>  
> >>>  /* One of the dpif channels between the kernel and userspace. */
> >>>  struct dpif_channel {
> >>> +    struct hmap_node hmap_node;
> >>
> >> This needs a comment and all other new structure fileds too.
> >> Ex.:
> >>     struct hmap_node hmap_node; /* In struct dpif_handler's channels hmap. */
> >>
> >>> +    struct dpif_handler *handler;
> 
> Ok, updated
> 
> >>
> >> Why do we need this back-pointer?  Maybe it's better to just add 'handler'
> >> argument to dpif_channel_del() ?
> 
> I have kept this as it is required due to the rework that Flavio has
> suggested in another thread.
> 
> >>
> >>> +    uint32_t port_idx;
> >>
> >> I see a lot of 'port_idx' here and there, but it's not really an index anymore.
> >> You removed all the arrays that used these variables as index.
> >> Can we just use 'port_no' directly here?  I know that there will be conversions
> >> from odp to u32, but this should be more clear.  If conversions are the issue,
> >> we could store u32 version of 'port_no' here.
> 
> I think this is a good idea. 'port_idx' is littered around this file in
> various places and it caused me an issue it a previous revision because
> the naming convention was not consistent. I will just remove all
> references to it.
> 
> >>
> >> And, please, add a comment.
> >>
> >>>      struct nl_sock *sock;       /* Netlink socket. */
> >>>      long long int last_poll;    /* Last time this channel was polled. */
> >>>  };
> >>> @@ -183,6 +187,8 @@ struct dpif_handler {
> >>>      int n_events;                 /* Num events returned by epoll_wait(). */
> >>>      int event_offset;             /* Offset into 'epoll_events'. */
> >>>  
> >>> +    struct hmap channels;         /* map of channels for each port. */
> >>
> >> Comment style: Capital letter.
> 
> Ok
> 
> >>
> >>> +
> >>>  #ifdef _WIN32
> >>>      /* Pool of sockets. */
> >>>      struct dpif_windows_vport_sock *vport_sock_pool;
> >>> @@ -201,9 +207,8 @@ struct dpif_netlink {
> >>>      struct fat_rwlock upcall_lock;
> >>>      struct dpif_handler *handlers;
> >>>      uint32_t n_handlers;           /* Num of upcall handlers. */
> >>> -    struct dpif_channel *channels; /* Array of channels for each port. */
> >>> -    int uc_array_size;             /* Size of 'handler->channels' and */
> >>> -                                   /* 'handler->epoll_events'. */
> >>> +
> >>> +    struct id_pool *port_ids;      /* Set of added port ids */
> >>
> >> What is the reason of having this pool?  Datapath port numbers are unique
> >> except for ODPP_NONE, and all the functions you implemented actually
> >> iterates over all channels anyway to find one with requested port_idx.
> >> So, the only purpose of it is to avoid searching on incorect port number,
> >> but this should be a rare case.  Can we just remove it?
> >>
> >> OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
> >> instead as it already has all required infrastructure and doesn't actually
> >> care about range of values.
> 
> I have replaced this with a hmap of all the channels as suggested in
> another thread by Flavio.

I think Ilya is referring to id_pool, not the hmap used for the
channels.


> >>>      /* Change notification. */
> >>>      struct nl_sock *port_notifier; /* vport multicast group subscriber. */
> >>> @@ -287,6 +292,44 @@ close_nl_sock(struct nl_sock *sock)
> >>>  #endif
> >>>  }
> >>>  
> >>> +static size_t
> >>> +dpif_handler_channels_count(const struct dpif_handler *handler)
> >>> +{
> >>> +    return hmap_count(&handler->channels);
> >>> +}
> >>> +
> >>> +static struct dpif_channel *
> >>> +dpif_handler_channel_find(struct dpif_handler *handler, uint32_t idx)
> >>> +{
> >>> +    struct dpif_channel *channel;
> >>> +
> >>> +    HMAP_FOR_EACH_WITH_HASH (channel, hmap_node, hash_int(idx, 0),
> >>> +                             &handler->channels) {
> >>> +        if (channel->port_idx == idx) {
> >>> +            return channel;
> >>> +        }
> >>> +    }
> >>> +
> >>> +    return NULL;
> >>> +}
> >>> +
> >>> +static void
> >>> +dpif_channel_del(struct dpif_netlink *dpif,
> >>> +                 struct dpif_channel *channel)
> >>
> >> As suggested above, it might make sense to just pass 'handler' here and
> >> avoid having back-pointer inside the channel structure.
> 
> I have left this like this as I have reworked based on Flavio's suggestion.

Before the back-pointer we had multiple iterations over the hmap.
The back-pointer helped to avoid that, but now it seems the code
was simplified enough that the back-pointer might not be needed
at all.

fbl
Mark Gray Sept. 21, 2020, 8:55 a.m. UTC | #11
On 18/09/2020 18:53, Flavio Leitner wrote:
> On Fri, Sep 18, 2020 at 05:31:30PM +0100, Mark Gray wrote:
>> On 17/09/2020 20:49, Aaron Conole wrote:
>>> Ilya Maximets <i.maximets@ovn.org> writes:

>>>> What is the reason of having this pool?  Datapath port numbers are unique
>>>> except for ODPP_NONE, and all the functions you implemented actually
>>>> iterates over all channels anyway to find one with requested port_idx.
>>>> So, the only purpose of it is to avoid searching on incorect port number,
>>>> but this should be a rare case.  Can we just remove it?
>>>>
>>>> OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
>>>> instead as it already has all required infrastructure and doesn't actually
>>>> care about range of values.
>>
>> I have replaced this with a hmap of all the channels as suggested in
>> another thread by Flavio.
> 
> I think Ilya is referring to id_pool, not the hmap used for the
> channels.
> 
> 

Yes, I have replaced the id_pool. I have sent out a v9.
diff mbox series

Patch

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 7da4fb54d..a74838506 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -37,6 +37,7 @@ 
 #include "dpif-provider.h"
 #include "fat-rwlock.h"
 #include "flow.h"
+#include "id-pool.h"
 #include "netdev-linux.h"
 #include "netdev-offload.h"
 #include "netdev-provider.h"
@@ -160,6 +161,9 @@  static void dpif_netlink_flow_to_dpif_flow(struct dpif_flow *,
 
 /* One of the dpif channels between the kernel and userspace. */
 struct dpif_channel {
+    struct hmap_node hmap_node;
+    struct dpif_handler *handler;
+    uint32_t port_idx;
     struct nl_sock *sock;       /* Netlink socket. */
     long long int last_poll;    /* Last time this channel was polled. */
 };
@@ -183,6 +187,8 @@  struct dpif_handler {
     int n_events;                 /* Num events returned by epoll_wait(). */
     int event_offset;             /* Offset into 'epoll_events'. */
 
+    struct hmap channels;         /* map of channels for each port. */
+
 #ifdef _WIN32
     /* Pool of sockets. */
     struct dpif_windows_vport_sock *vport_sock_pool;
@@ -201,9 +207,8 @@  struct dpif_netlink {
     struct fat_rwlock upcall_lock;
     struct dpif_handler *handlers;
     uint32_t n_handlers;           /* Num of upcall handlers. */
-    struct dpif_channel *channels; /* Array of channels for each port. */
-    int uc_array_size;             /* Size of 'handler->channels' and */
-                                   /* 'handler->epoll_events'. */
+
+    struct id_pool *port_ids;      /* Set of added port ids */
 
     /* Change notification. */
     struct nl_sock *port_notifier; /* vport multicast group subscriber. */
@@ -287,6 +292,44 @@  close_nl_sock(struct nl_sock *sock)
 #endif
 }
 
+static size_t
+dpif_handler_channels_count(const struct dpif_handler *handler)
+{
+    return hmap_count(&handler->channels);
+}
+
+static struct dpif_channel *
+dpif_handler_channel_find(struct dpif_handler *handler, uint32_t idx)
+{
+    struct dpif_channel *channel;
+
+    HMAP_FOR_EACH_WITH_HASH (channel, hmap_node, hash_int(idx, 0),
+                             &handler->channels) {
+        if (channel->port_idx == idx) {
+            return channel;
+        }
+    }
+
+    return NULL;
+}
+
+static void
+dpif_channel_del(struct dpif_netlink *dpif,
+                 struct dpif_channel *channel)
+{
+    struct dpif_handler *handler = channel->handler;
+
+#ifndef _WIN32
+    epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
+              nl_sock_fd(channel->sock), NULL);
+    nl_sock_destroy(channel->sock);
+#endif
+    hmap_remove(&handler->channels, &channel->hmap_node);
+    id_pool_free_id(dpif->port_ids, channel->port_idx);
+    handler->event_offset = handler->n_events = 0;
+    free(channel);
+}
+
 static struct dpif_netlink *
 dpif_netlink_cast(const struct dpif *dpif)
 {
@@ -385,6 +428,7 @@  open_dpif(const struct dpif_netlink_dp *dp, struct dpif **dpifp)
 
     dpif->dp_ifindex = dp->dp_ifindex;
     dpif->user_features = dp->user_features;
+    dpif->port_ids = id_pool_create(0, MAX_PORTS);
     *dpifp = &dpif->dpif;
 
     return 0;
@@ -446,167 +490,151 @@  error:
 }
 #endif /* _WIN32 */
 
-/* Given the port number 'port_idx', extracts the pid of netlink socket
+/* Given the port number 'port_no', extracts the pid of netlink socket
  * associated to the port and assigns it to 'upcall_pid'. */
 static bool
-vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx,
+vport_get_pid(struct dpif_netlink *dpif, odp_port_t port_no,
               uint32_t *upcall_pid)
 {
-    /* Since the nl_sock can only be assigned in either all
-     * or none "dpif" channels, the following check
-     * would suffice. */
-    if (!dpif->channels[port_idx].sock) {
+    uint32_t port_idx = odp_to_u32(port_no);
+    size_t i;
+
+    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
+
+    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
         return false;
     }
-    ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
 
-    *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock);
+    /* The 'port_idx' should only be valid for a single handler. */
+    for (i = 0; i < dpif->n_handlers; i++) {
+        struct dpif_channel *channel;
+
+        channel = dpif_handler_channel_find(&dpif->handlers[i], port_idx);
+        if (channel) {
+            *upcall_pid = nl_sock_pid(channel->sock);
+            return true;
+        }
+    }
 
-    return true;
+    return false;
 }
 
 static int
 vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
                   struct nl_sock *sock)
 {
-    struct epoll_event event;
     uint32_t port_idx = odp_to_u32(port_no);
+    struct dpif_channel *channel;
+    struct dpif_handler *handler;
+    struct epoll_event event;
     size_t i;
-    int error;
 
-    if (dpif->handlers == NULL) {
+    if (dpif->handlers == NULL ||
+        id_pool_has_id(dpif->port_ids, port_idx)) {
         close_nl_sock(sock);
-        return 0;
+        return EINVAL;
     }
 
-    /* We assume that the datapath densely chooses port numbers, which can
-     * therefore be used as an index into 'channels' and 'epoll_events' of
-     * 'dpif'. */
-    if (port_idx >= dpif->uc_array_size) {
-        uint32_t new_size = port_idx + 1;
-
-        if (new_size > MAX_PORTS) {
-            VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
-                         dpif_name(&dpif->dpif), port_no);
-            return EFBIG;
-        }
+    if (port_idx >= MAX_PORTS) {
+        VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
+                     dpif_name(&dpif->dpif), port_no);
+        return EFBIG;
+    }
 
-        dpif->channels = xrealloc(dpif->channels,
-                                  new_size * sizeof *dpif->channels);
+    handler = &dpif->handlers[0];
 
-        for (i = dpif->uc_array_size; i < new_size; i++) {
-            dpif->channels[i].sock = NULL;
+    for (i = 0; i < dpif->n_handlers; i++) {
+        if (dpif_handler_channels_count(&dpif->handlers[i]) <
+            dpif_handler_channels_count(handler)) {
+            handler = &dpif->handlers[i];
         }
+    }
 
-        for (i = 0; i < dpif->n_handlers; i++) {
-            struct dpif_handler *handler = &dpif->handlers[i];
+    channel = xmalloc(sizeof *channel);
+    channel->port_idx = port_idx;
+    channel->sock = sock;
+    channel->last_poll = LLONG_MIN;
+    channel->handler = handler;
+    hmap_insert(&handler->channels, &channel->hmap_node,
+                hash_int(port_idx, 0));
 
-            handler->epoll_events = xrealloc(handler->epoll_events,
-                new_size * sizeof *handler->epoll_events);
-
-        }
-        dpif->uc_array_size = new_size;
-    }
+    handler->epoll_events = xrealloc(handler->epoll_events,
+                                     dpif_handler_channels_count(handler) *
+                                     sizeof *handler->epoll_events);
 
     memset(&event, 0, sizeof event);
     event.events = EPOLLIN | EPOLLEXCLUSIVE;
     event.data.u32 = port_idx;
 
-    for (i = 0; i < dpif->n_handlers; i++) {
-        struct dpif_handler *handler = &dpif->handlers[i];
-
 #ifndef _WIN32
-        if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
-                      &event) < 0) {
-            error = errno;
-            goto error;
-        }
-#endif
-    }
-    dpif->channels[port_idx].sock = sock;
-    dpif->channels[port_idx].last_poll = LLONG_MIN;
-
-    return 0;
-
-error:
-#ifndef _WIN32
-    while (i--) {
-        epoll_ctl(dpif->handlers[i].epoll_fd, EPOLL_CTL_DEL,
-                  nl_sock_fd(sock), NULL);
+    if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
+                  &event) < 0) {
+        hmap_remove(&handler->channels, &channel->hmap_node);
+        free(channel);
+        id_pool_free_id(dpif->port_ids, port_idx);
+        return errno;
     }
 #endif
-    dpif->channels[port_idx].sock = NULL;
 
-    return error;
+    id_pool_add(dpif->port_ids, port_idx);
+    return 0;
 }
 
-static void
-vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no)
+static bool
+vport_del_channel(struct dpif_netlink *dpif, odp_port_t port_no)
 {
-    uint32_t port_idx = odp_to_u32(port_no);
     size_t i;
 
-    if (!dpif->handlers || port_idx >= dpif->uc_array_size
-        || !dpif->channels[port_idx].sock) {
-        return;
-    }
-
     for (i = 0; i < dpif->n_handlers; i++) {
-        struct dpif_handler *handler = &dpif->handlers[i];
-#ifndef _WIN32
-        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
-                  nl_sock_fd(dpif->channels[port_idx].sock), NULL);
-#endif
-        handler->event_offset = handler->n_events = 0;
+        struct dpif_channel *channel;
+
+        channel = dpif_handler_channel_find(&dpif->handlers[i],
+                                            odp_to_u32(port_no));
+        if (channel) {
+            dpif_channel_del(dpif, channel);
+            return true;
+        }
     }
-#ifndef _WIN32
-    nl_sock_destroy(dpif->channels[port_idx].sock);
-#endif
-    dpif->channels[port_idx].sock = NULL;
+
+    return false;
 }
 
 static void
 destroy_all_channels(struct dpif_netlink *dpif)
     OVS_REQ_WRLOCK(dpif->upcall_lock)
 {
-    unsigned int i;
+    size_t i;
 
     if (!dpif->handlers) {
         return;
     }
 
-    for (i = 0; i < dpif->uc_array_size; i++ ) {
-        struct dpif_netlink_vport vport_request;
-        uint32_t upcall_pids = 0;
-
-        if (!dpif->channels[i].sock) {
-            continue;
-        }
-
-        /* Turn off upcalls. */
-        dpif_netlink_vport_init(&vport_request);
-        vport_request.cmd = OVS_VPORT_CMD_SET;
-        vport_request.dp_ifindex = dpif->dp_ifindex;
-        vport_request.port_no = u32_to_odp(i);
-        vport_request.n_upcall_pids = 1;
-        vport_request.upcall_pids = &upcall_pids;
-        dpif_netlink_vport_transact(&vport_request, NULL, NULL);
-
-        vport_del_channels(dpif, u32_to_odp(i));
-    }
-
     for (i = 0; i < dpif->n_handlers; i++) {
         struct dpif_handler *handler = &dpif->handlers[i];
+        struct dpif_channel *channel, *next;
+
+        HMAP_FOR_EACH_SAFE (channel, next, hmap_node, &handler->channels) {
+            struct dpif_netlink_vport vport_request;
+            uint32_t upcall_pids = 0;
+            /* Turn off upcalls. */
+            dpif_netlink_vport_init(&vport_request);
+            vport_request.cmd = OVS_VPORT_CMD_SET;
+            vport_request.dp_ifindex = dpif->dp_ifindex;
+            vport_request.port_no = u32_to_odp(channel->port_idx);
+            vport_request.n_upcall_pids = 1;
+            vport_request.upcall_pids = &upcall_pids;
+            dpif_netlink_vport_transact(&vport_request, NULL, NULL);
+            ovs_assert(channel != NULL);
+            dpif_channel_del(dpif, channel);
+        }
 
         dpif_netlink_handler_uninit(handler);
         free(handler->epoll_events);
     }
-    free(dpif->channels);
+
     free(dpif->handlers);
     dpif->handlers = NULL;
-    dpif->channels = NULL;
     dpif->n_handlers = 0;
-    dpif->uc_array_size = 0;
 }
 
 static void
@@ -618,9 +646,11 @@  dpif_netlink_close(struct dpif *dpif_)
 
     fat_rwlock_wrlock(&dpif->upcall_lock);
     destroy_all_channels(dpif);
+    id_pool_destroy(dpif->port_ids);
     fat_rwlock_unlock(&dpif->upcall_lock);
 
     fat_rwlock_destroy(&dpif->upcall_lock);
+
     free(dpif);
 }
 
@@ -1003,7 +1033,9 @@  dpif_netlink_port_del__(struct dpif_netlink *dpif, odp_port_t port_no)
 #endif
     error = dpif_netlink_vport_transact(&vport, NULL, NULL);
 
-    vport_del_channels(dpif, port_no);
+    if (!vport_del_channel(dpif, port_no)) {
+        return EINVAL;
+    }
 
     if (!error && !ovs_tunnels_out_of_tree) {
         error = dpif_netlink_rtnl_port_destroy(dpif_port.name, dpif_port.type);
@@ -1084,23 +1116,39 @@  dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif,
                             odp_port_t port_no)
     OVS_REQ_RDLOCK(dpif->upcall_lock)
 {
+    struct dpif_channel *min_channel = NULL, *found_channel = NULL;
     uint32_t port_idx = odp_to_u32(port_no);
     uint32_t pid = 0;
+    size_t i;
+
+    /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
+     * channel, since it is not heavily loaded. */
+    uint32_t idx = port_no == ODPP_NONE ? 0 : port_idx;
+
+    if (!id_pool_has_id(dpif->port_ids, port_idx)) {
+        return 0;
+    }
+
+    if (dpif->handlers) {
+        for (i = 0; i < dpif->n_handlers; i++) {
+            if (dpif_handler_channel_find(&dpif->handlers[i], idx)) {
+                found_channel = dpif_handler_channel_find(&dpif->handlers[i],
+                                                          idx);
+                break;
+            }
 
-    if (dpif->handlers && dpif->uc_array_size > 0) {
-        /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
-         * channel, since it is not heavily loaded. */
-        uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx;
-
-        /* Needs to check in case the socket pointer is changed in between
-         * the holding of upcall_lock.  A known case happens when the main
-         * thread deletes the vport while the handler thread is handling
-         * the upcall from that port. */
-        if (dpif->channels[idx].sock) {
-            pid = nl_sock_pid(dpif->channels[idx].sock);
+            if (dpif_handler_channel_find(&dpif->handlers[i], 0)) {
+                min_channel = dpif_handler_channel_find(&dpif->handlers[i], 0);
+            }
         }
     }
 
+    if (found_channel) {
+        pid = nl_sock_pid(found_channel->sock);
+    } else if (min_channel) {
+        pid = nl_sock_pid(min_channel->sock);
+    }
+
     return pid;
 }
 
@@ -2342,12 +2390,14 @@  dpif_netlink_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops,
 static void
 dpif_netlink_handler_uninit(struct dpif_handler *handler)
 {
+    hmap_destroy(&handler->channels);
     vport_delete_sock_pool(handler);
 }
 
 static int
 dpif_netlink_handler_init(struct dpif_handler *handler)
 {
+    hmap_init(&handler->channels);
     return vport_create_sock_pool(handler);
 }
 #else
@@ -2355,6 +2405,7 @@  dpif_netlink_handler_init(struct dpif_handler *handler)
 static int
 dpif_netlink_handler_init(struct dpif_handler *handler)
 {
+    hmap_init(&handler->channels);
     handler->epoll_fd = epoll_create(10);
     return handler->epoll_fd < 0 ? errno : 0;
 }
@@ -2362,6 +2413,7 @@  dpif_netlink_handler_init(struct dpif_handler *handler)
 static void
 dpif_netlink_handler_uninit(struct dpif_handler *handler)
 {
+    hmap_destroy(&handler->channels);
     close(handler->epoll_fd);
 }
 #endif
@@ -2374,9 +2426,7 @@  static int
 dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
     OVS_REQ_WRLOCK(dpif->upcall_lock)
 {
-    unsigned long int *keep_channels;
     struct dpif_netlink_vport vport;
-    size_t keep_channels_nbits;
     struct nl_dump dump;
     uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
     struct ofpbuf buf;
@@ -2412,22 +2462,16 @@  dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
 
     for (i = 0; i < n_handlers; i++) {
         struct dpif_handler *handler = &dpif->handlers[i];
-
         handler->event_offset = handler->n_events = 0;
     }
 
-    keep_channels_nbits = dpif->uc_array_size;
-    keep_channels = bitmap_allocate(keep_channels_nbits);
-
     ofpbuf_use_stub(&buf, reply_stub, sizeof reply_stub);
     dpif_netlink_port_dump_start__(dpif, &dump);
     while (!dpif_netlink_port_dump_next__(dpif, &dump, &vport, &buf)) {
-        uint32_t port_no = odp_to_u32(vport.port_no);
         uint32_t upcall_pid;
         int error;
 
-        if (port_no >= dpif->uc_array_size
-            || !vport_get_pid(dpif, port_no, &upcall_pid)) {
+        if (!vport_get_pid(dpif, vport.port_no, &upcall_pid)) {
             struct nl_sock *sock;
             error = create_nl_sock(dpif, &sock);
 
@@ -2475,25 +2519,15 @@  dpif_netlink_refresh_channels(struct dpif_netlink *dpif, uint32_t n_handlers)
             }
         }
 
-        if (port_no < keep_channels_nbits) {
-            bitmap_set1(keep_channels, port_no);
-        }
         continue;
 
     error:
-        vport_del_channels(dpif, vport.port_no);
+        vport_del_channel(dpif, vport.port_no);
     }
+
     nl_dump_done(&dump);
     ofpbuf_uninit(&buf);
 
-    /* Discard any saved channels that we didn't reuse. */
-    for (i = 0; i < keep_channels_nbits; i++) {
-        if (!bitmap_is_set(keep_channels, i)) {
-            vport_del_channels(dpif, u32_to_odp(i));
-        }
-    }
-    free(keep_channels);
-
     return retval;
 }
 
@@ -2714,6 +2748,8 @@  dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
     OVS_REQ_RDLOCK(dpif->upcall_lock)
 {
     struct dpif_handler *handler;
+    size_t n_channels;
+
     int read_tries = 0;
 
     if (!dpif->handlers || handler_id >= dpif->n_handlers) {
@@ -2721,14 +2757,15 @@  dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
     }
 
     handler = &dpif->handlers[handler_id];
-    if (handler->event_offset >= handler->n_events) {
+    n_channels = dpif_handler_channels_count(handler);
+    if (handler->event_offset >= handler->n_events && n_channels) {
         int retval;
 
         handler->event_offset = handler->n_events = 0;
 
         do {
             retval = epoll_wait(handler->epoll_fd, handler->epoll_events,
-                                dpif->uc_array_size, 0);
+                                n_channels, 0);
         } while (retval < 0 && errno == EINTR);
 
         if (retval < 0) {
@@ -2741,7 +2778,10 @@  dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
 
     while (handler->event_offset < handler->n_events) {
         int idx = handler->epoll_events[handler->event_offset].data.u32;
-        struct dpif_channel *ch = &dpif->channels[idx];
+        struct dpif_channel *ch = dpif_handler_channel_find(handler, idx);
+        if (!ch) {
+            return EAGAIN;
+        }
 
         handler->event_offset++;
 
@@ -2845,12 +2885,14 @@  dpif_netlink_recv_purge__(struct dpif_netlink *dpif)
     if (dpif->handlers) {
         size_t i;
 
-        if (!dpif->channels[0].sock) {
-            return;
-        }
-        for (i = 0; i < dpif->uc_array_size; i++ ) {
+        for (i = 0; i < dpif->n_handlers; i++) {
+            struct dpif_handler *handler = &dpif->handlers[i];
+            struct dpif_channel *channel;
+
+            HMAP_FOR_EACH (channel, hmap_node, &handler->channels) {
+                nl_sock_drain(channel->sock);
+            }
 
-            nl_sock_drain(dpif->channels[i].sock);
         }
     }
 }
diff --git a/lib/id-pool.c b/lib/id-pool.c
index 69910ad08..bef822f6b 100644
--- a/lib/id-pool.c
+++ b/lib/id-pool.c
@@ -93,6 +93,16 @@  id_pool_find(struct id_pool *pool, uint32_t id)
     return NULL;
 }
 
+bool
+id_pool_has_id(struct id_pool *pool, uint32_t id)
+{
+    if (!id_pool_find(pool, id)) {
+        return false;
+    }
+
+    return true;
+}
+
 void
 id_pool_add(struct id_pool *pool, uint32_t id)
 {
diff --git a/lib/id-pool.h b/lib/id-pool.h
index 8721f8793..62876e2a5 100644
--- a/lib/id-pool.h
+++ b/lib/id-pool.h
@@ -29,6 +29,7 @@  void id_pool_destroy(struct id_pool *);
 bool id_pool_alloc_id(struct id_pool *, uint32_t *id);
 void id_pool_free_id(struct id_pool *, uint32_t id);
 void id_pool_add(struct id_pool *, uint32_t id);
+bool id_pool_has_id(struct id_pool *, uint32_t id);
 
 /*
  * ID pool.