diff mbox series

[ovs-dev,RFC] dpif-netlink: distribute polling to discreet handlers

Message ID 20200624134800.2905038-1-aconole@redhat.com
State New
Headers show
Series [ovs-dev,RFC] dpif-netlink: distribute polling to discreet handlers | expand

Commit Message

Aaron Conole June 24, 2020, 1:48 p.m. UTC
Currently, the channel handlers are polled globally.  On some
systems, this causes a thundering herd issue where multiple
handler threads become active, only to do no work and immediately
sleep.

The approach here is to push the netlink socket channels to discreet
handler threads to process, rather than polling on every thread.
This will eliminate the need to wake multiple threads.

To check:

  ip netns add left
  ip netns add right
  ip link add center-left type veth peer name left0
  ip link add center-right type veth peer name right0
  ip link set left0 netns left
  ip link set right0 netns right
  ip link set center-left up
  ip link set center-right up
  ip netns exec left ip link set left0 up
  ip netns exec left ip addr add 172.31.110.10/24 dev left0
  ip netns exec right ip link set right0 up
  ip netns exec right ip addr add 172.31.110.11/24 dev right0

  ovs-vsctl add-br br0
  ovs-vsctl add-port br0 center-right
  ovs-vsctl add-port br0 center-left

  # in one terminal
  perf record -e sched:sched_wakeup,irq:softirq_entry -ag

  # in a separate terminal
  ip netns exec left arping -I left0 -c 1 172.31.110.11

  # in the perf terminal after exiting
  perf script

Look for the number of 'handler' threads which were made active.

Reported-by: David Ahern <dsahern@gmail.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
Cc: Matteo Croce <technoboy85@gmail.com>
Cc: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Aaron Conole <aconole@redhat.com>
---
I haven't finished all my testing, but I wanted to send this
for early feedback in case I went completely off course.

 lib/dpif-netlink.c | 190 ++++++++++++++++++++++++++-------------------
 1 file changed, 112 insertions(+), 78 deletions(-)

Comments

Matteo Croce June 26, 2020, 12:07 a.m. UTC | #1
On Wed, Jun 24, 2020 at 3:48 PM Aaron Conole <aconole@redhat.com> wrote:
>
> Currently, the channel handlers are polled globally.  On some
> systems, this causes a thundering herd issue where multiple
> handler threads become active, only to do no work and immediately
> sleep.
>
> The approach here is to push the netlink socket channels to discreet
> handler threads to process, rather than polling on every thread.
> This will eliminate the need to wake multiple threads.
>
> To check:
>
>   ip netns add left
>   ip netns add right
>   ip link add center-left type veth peer name left0
>   ip link add center-right type veth peer name right0
>   ip link set left0 netns left
>   ip link set right0 netns right
>   ip link set center-left up
>   ip link set center-right up
>   ip netns exec left ip link set left0 up
>   ip netns exec left ip addr add 172.31.110.10/24 dev left0
>   ip netns exec right ip link set right0 up
>   ip netns exec right ip addr add 172.31.110.11/24 dev right0
>
>   ovs-vsctl add-br br0
>   ovs-vsctl add-port br0 center-right
>   ovs-vsctl add-port br0 center-left
>
>   # in one terminal
>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
>
>   # in a separate terminal
>   ip netns exec left arping -I left0 -c 1 172.31.110.11
>
>   # in the perf terminal after exiting
>   perf script
>
> Look for the number of 'handler' threads which were made active.
>
> Reported-by: David Ahern <dsahern@gmail.com>
> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
> Cc: Matteo Croce <technoboy85@gmail.com>
> Cc: Flavio Leitner <fbl@sysclose.org>
> Signed-off-by: Aaron Conole <aconole@redhat.com>
>

Great work!

Just to understand, this reverts some logic of my patch, and
transposes the threads and sockets?

Maybe it's the case to add a Fixes tag pointing to 69c51582f ?

Nitpick: `iproute -n` instead of `ip netns exec` where possible, e.g.

ip netns exec left ip link set left0 up

becomes

ip -n left link set left0 up

Cheers,
Aaron Conole June 26, 2020, 1:53 p.m. UTC | #2
Matteo Croce <technoboy85@gmail.com> writes:

> On Wed, Jun 24, 2020 at 3:48 PM Aaron Conole <aconole@redhat.com> wrote:
>>
>> Currently, the channel handlers are polled globally.  On some
>> systems, this causes a thundering herd issue where multiple
>> handler threads become active, only to do no work and immediately
>> sleep.
>>
>> The approach here is to push the netlink socket channels to discreet
>> handler threads to process, rather than polling on every thread.
>> This will eliminate the need to wake multiple threads.
>>
>> To check:
>>
>>   ip netns add left
>>   ip netns add right
>>   ip link add center-left type veth peer name left0
>>   ip link add center-right type veth peer name right0
>>   ip link set left0 netns left
>>   ip link set right0 netns right
>>   ip link set center-left up
>>   ip link set center-right up
>>   ip netns exec left ip link set left0 up
>>   ip netns exec left ip addr add 172.31.110.10/24 dev left0
>>   ip netns exec right ip link set right0 up
>>   ip netns exec right ip addr add 172.31.110.11/24 dev right0
>>
>>   ovs-vsctl add-br br0
>>   ovs-vsctl add-port br0 center-right
>>   ovs-vsctl add-port br0 center-left
>>
>>   # in one terminal
>>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
>>
>>   # in a separate terminal
>>   ip netns exec left arping -I left0 -c 1 172.31.110.11
>>
>>   # in the perf terminal after exiting
>>   perf script
>>
>> Look for the number of 'handler' threads which were made active.
>>
>> Reported-by: David Ahern <dsahern@gmail.com>
>> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
>> Cc: Matteo Croce <technoboy85@gmail.com>
>> Cc: Flavio Leitner <fbl@sysclose.org>
>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>>
>
> Great work!
>
> Just to understand, this reverts some logic of my patch, and
> transposes the threads and sockets?

Yes.

> Maybe it's the case to add a Fixes tag pointing to 69c51582f ?

Okay, I can include it when I post with removing the RFC tag.

> Nitpick: `iproute -n` instead of `ip netns exec` where possible, e.g.
>
> ip netns exec left ip link set left0 up
>
> becomes
>
> ip -n left link set left0 up

Neat!

> Cheers,
Flavio Leitner June 30, 2020, 8:37 p.m. UTC | #3
On Wed, Jun 24, 2020 at 09:48:00AM -0400, Aaron Conole wrote:
> Currently, the channel handlers are polled globally.  On some
> systems, this causes a thundering herd issue where multiple
> handler threads become active, only to do no work and immediately
> sleep.
> 
> The approach here is to push the netlink socket channels to discreet
> handler threads to process, rather than polling on every thread.
> This will eliminate the need to wake multiple threads.
> 
> To check:
> 
>   ip netns add left
>   ip netns add right
>   ip link add center-left type veth peer name left0
>   ip link add center-right type veth peer name right0
>   ip link set left0 netns left
>   ip link set right0 netns right
>   ip link set center-left up
>   ip link set center-right up
>   ip netns exec left ip link set left0 up
>   ip netns exec left ip addr add 172.31.110.10/24 dev left0
>   ip netns exec right ip link set right0 up
>   ip netns exec right ip addr add 172.31.110.11/24 dev right0
> 
>   ovs-vsctl add-br br0
>   ovs-vsctl add-port br0 center-right
>   ovs-vsctl add-port br0 center-left
> 
>   # in one terminal
>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
> 
>   # in a separate terminal
>   ip netns exec left arping -I left0 -c 1 172.31.110.11
> 
>   # in the perf terminal after exiting
>   perf script
> 
> Look for the number of 'handler' threads which were made active.

I am surprised to find out that in 2020 and Linux doesn't provide
a reasonable solution for the thundering herd issue. I wonder how
many applications have this issue out there. It looks like the
devil in is the details.

Anyways, I think the previous model didn't work because the number
of attached ports is increasing and so is the number of cores. The
number of sockets escalates quickly and it had the packet re-ordering
issue.

Moving forward, we moved to one socket per port with all threads
polling the same socket in the hope that the kernel would only wake
one or very few threads. Looks like that doesn't work and the packet
re-ordering is still an issue.

Therefore this approach where there is one socket per port but only
thread polling would solve the thundering herd issue and the packet
re-ordering issue.

One concern is with load capacity, since now only one thread would
be responsible to process a port, but I think that's what we had
initially, correct? So, it's not an issue.

It looks like now it is allocating enough memory for all port_idx
per thread even though it will not use all of them. Perhaps that
can be fixed? For instance, instead of indexing the array with
port_idx, use handler->port_idx as an identifier? 

More inline

> Reported-by: David Ahern <dsahern@gmail.com>
> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
> Cc: Matteo Croce <technoboy85@gmail.com>
> Cc: Flavio Leitner <fbl@sysclose.org>
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> ---
> I haven't finished all my testing, but I wanted to send this
> for early feedback in case I went completely off course.
> 
>  lib/dpif-netlink.c | 190 ++++++++++++++++++++++++++-------------------
>  1 file changed, 112 insertions(+), 78 deletions(-)
> 
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 1817e9f849..d59375cf00 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -179,9 +179,11 @@ struct dpif_windows_vport_sock {
>  
>  struct dpif_handler {
>      struct epoll_event *epoll_events;
> -    int epoll_fd;                 /* epoll fd that includes channel socks. */
> -    int n_events;                 /* Num events returned by epoll_wait(). */
> -    int event_offset;             /* Offset into 'epoll_events'. */
> +    struct dpif_channel *channels; /* Array of channels for each port. */
> +    size_t n_channels;             /* Count of channels for each port. */
> +    int epoll_fd;                  /* epoll fd that includes channel socks. */
> +    int n_events;                  /* Num events returned by epoll_wait(). */
> +    int event_offset;              /* Offset into 'epoll_events'. */
>  
>  #ifdef _WIN32
>      /* Pool of sockets. */
> @@ -201,7 +203,6 @@ struct dpif_netlink {
>      struct fat_rwlock upcall_lock;
>      struct dpif_handler *handlers;
>      uint32_t n_handlers;           /* Num of upcall handlers. */
> -    struct dpif_channel *channels; /* Array of channels for each port. */
>      int uc_array_size;             /* Size of 'handler->channels' and */
>                                     /* 'handler->epoll_events'. */
>  
> @@ -452,15 +453,22 @@ static bool
>  vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx,
>                uint32_t *upcall_pid)
>  {
> -    /* Since the nl_sock can only be assigned in either all
> -     * or none "dpif" channels, the following check
> -     * would suffice. */
> -    if (!dpif->channels[port_idx].sock) {
> -        return false;
> -    }
> +    size_t i;
>      ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
>  
> -    *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock);
> +    /* The 'port_idx' should only be valid for a single handler. */
> +    for (i = 0; i < dpif->n_handlers; ++i) {
> +
> +        if (port_idx < dpif->handlers[i].n_channels &&
> +            dpif->handlers[i].channels[port_idx].sock) {
> +            *upcall_pid =
> +                nl_sock_pid(dpif->handlers[i].channels[port_idx].sock);
> +            break;
> +        }
> +    }


Perhaps

     /* The 'port_idx' should only be valid for a single handler. */
-    for (i = 0; i < dpif->n_handlers; ++i) {
+    for (i = 0; i < dpif->n_handlers; i++) {
+        struct dpif_handler *handler = &dpif->handlers[i];
 
-        if (port_idx < dpif->handlers[i].n_channels &&
-            dpif->handlers[i].channels[port_idx].sock) {
+        if (port_idx < handler->n_channels &&
+            handler->channels[port_idx].sock) {
             *upcall_pid =
-                nl_sock_pid(dpif->handlers[i].channels[port_idx].sock);
+                nl_sock_pid(handler->channels[port_idx].sock);
             break;
         }
     }

the same for similar places below.


> +
> +    if (i == dpif->n_handlers)
> +        return false;
>  
>      return true;
>  }
> @@ -473,15 +481,23 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
>      uint32_t port_idx = odp_to_u32(port_no);
>      size_t i;
>      int error;
> +    struct dpif_handler *handler = NULL;
>  
>      if (dpif->handlers == NULL) {
>          close_nl_sock(sock);
>          return 0;
>      }
>  
> +    /* choose a handler by finding the least populated handler. */
> +    handler = &dpif->handlers[0];
> +    for (i = 0; i < dpif->n_handlers; ++i) {
> +        if (dpif->handlers[i].n_channels < handler->n_channels)
> +            handler = &dpif->handlers[i];
> +    }
> +
>      /* We assume that the datapath densely chooses port numbers, which can
>       * therefore be used as an index into 'channels' and 'epoll_events' of
> -     * 'dpif'. */
> +     * 'dpif->handler'. */
>      if (port_idx >= dpif->uc_array_size) {
>          uint32_t new_size = port_idx + 1;
>  
> @@ -491,40 +507,33 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
>              return EFBIG;
>          }
>  
> -        dpif->channels = xrealloc(dpif->channels,
> -                                  new_size * sizeof *dpif->channels);
> +        handler->channels = xrealloc(handler->channels,
> +                                     new_size * sizeof *handler->channels);
>  
> -        for (i = dpif->uc_array_size; i < new_size; i++) {
> -            dpif->channels[i].sock = NULL;
> +        for (i = handler->n_channels; i < new_size; i++) {
> +            handler->channels[i].sock = NULL;
>          }
>  
> -        for (i = 0; i < dpif->n_handlers; i++) {
> -            struct dpif_handler *handler = &dpif->handlers[i];
> -
> -            handler->epoll_events = xrealloc(handler->epoll_events,
> -                new_size * sizeof *handler->epoll_events);
> -
> -        }
> +        handler->epoll_events = xrealloc(handler->epoll_events,
> +                                         new_size * sizeof *handler->epoll_events);
>          dpif->uc_array_size = new_size;
> +        handler->n_channels = new_size;
>      }
>  
>      memset(&event, 0, sizeof event);
>      event.events = EPOLLIN | EPOLLEXCLUSIVE;
>      event.data.u32 = port_idx;
>  
> -    for (i = 0; i < dpif->n_handlers; i++) {
> -        struct dpif_handler *handler = &dpif->handlers[i];
> -
>  #ifndef _WIN32
> -        if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
> -                      &event) < 0) {
> -            error = errno;
> -            goto error;
> -        }
> -#endif
> +    if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
> +                  &event) < 0) {
> +        error = errno;
> +        goto error;
>      }
> -    dpif->channels[port_idx].sock = sock;
> -    dpif->channels[port_idx].last_poll = LLONG_MIN;
> +#endif
> +
> +    handler->channels[port_idx].sock = sock;
> +    handler->channels[port_idx].last_poll = LLONG_MIN;
>  
>      return 0;
>  
> @@ -535,34 +544,53 @@ error:
>                    nl_sock_fd(sock), NULL);
>      }
>  #endif
> -    dpif->channels[port_idx].sock = NULL;
> +    handler->channels[port_idx].sock = NULL;
>  
>      return error;
>  }
>  
> +static void
> +vport_del_channel(struct dpif_handler *handler, uint32_t port_idx)
> +{
> +    if (port_idx < handler->n_channels &&
> +        handler->channels[port_idx].sock) {
> +#ifndef _WIN32
> +        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
> +                  nl_sock_fd(handler->channels[port_idx].sock), NULL);
> +#endif
> +        handler->event_offset = handler->n_events = 0;
> +
> +#ifndef _WIN32
> +        nl_sock_destroy(handler->channels[port_idx].sock);
> +#endif
> +        handler->channels[port_idx].sock = NULL;
> +    }        
> +}
> +
>  static void
>  vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no)
>  {
>      uint32_t port_idx = odp_to_u32(port_no);
> +    struct dpif_handler *handler = NULL;
>      size_t i;
>  
> -    if (!dpif->handlers || port_idx >= dpif->uc_array_size
> -        || !dpif->channels[port_idx].sock) {
> +    if (!dpif->handlers || port_idx >= dpif->uc_array_size) {
>          return;
>      }
>  
>      for (i = 0; i < dpif->n_handlers; i++) {
> -        struct dpif_handler *handler = &dpif->handlers[i];
> -#ifndef _WIN32
> -        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
> -                  nl_sock_fd(dpif->channels[port_idx].sock), NULL);
> -#endif
> -        handler->event_offset = handler->n_events = 0;
> +        handler = &dpif->handlers[i];
> +
> +        if (port_idx >= handler->n_channels ||
> +            !handler->channels[port_idx].sock) {
> +            handler = NULL;
> +            continue;
> +        }
> +    }
> +
> +    if (handler) {
> +        vport_del_channel(handler, port_idx);
>      }
> -#ifndef _WIN32
> -    nl_sock_destroy(dpif->channels[port_idx].sock);
> -#endif
> -    dpif->channels[port_idx].sock = NULL;
>  }
>  
>  static void
> @@ -575,36 +603,38 @@ destroy_all_channels(struct dpif_netlink *dpif)
>          return;
>      }
>  
> -    for (i = 0; i < dpif->uc_array_size; i++ ) {
> -        struct dpif_netlink_vport vport_request;
> -        uint32_t upcall_pids = 0;
> -
> -        if (!dpif->channels[i].sock) {
> -            continue;
> -        }
> +    for (i = 0; i < dpif->n_handlers; i++) {
> +        struct dpif_handler *handler = &dpif->handlers[i];
> +        size_t j;
>  
> -        /* Turn off upcalls. */
> -        dpif_netlink_vport_init(&vport_request);
> -        vport_request.cmd = OVS_VPORT_CMD_SET;
> -        vport_request.dp_ifindex = dpif->dp_ifindex;
> -        vport_request.port_no = u32_to_odp(i);
> -        vport_request.n_upcall_pids = 1;
> -        vport_request.upcall_pids = &upcall_pids;
> -        dpif_netlink_vport_transact(&vport_request, NULL, NULL);
> +        for (j = 0; j < handler->n_channels; j++ ) {
> +            struct dpif_netlink_vport vport_request;
> +            uint32_t upcall_pids = 0;
>  
> -        vport_del_channels(dpif, u32_to_odp(i));
> -    }
> +            if (!handler->channels[j].sock) {
> +                continue;
> +            }
>  
> -    for (i = 0; i < dpif->n_handlers; i++) {
> -        struct dpif_handler *handler = &dpif->handlers[i];
> +            /* Turn off upcalls. */
> +            dpif_netlink_vport_init(&vport_request);
> +            vport_request.cmd = OVS_VPORT_CMD_SET;
> +            vport_request.dp_ifindex = dpif->dp_ifindex;
> +            vport_request.port_no = u32_to_odp(j);
> +            vport_request.n_upcall_pids = 1;
> +            vport_request.upcall_pids = &upcall_pids;
> +            dpif_netlink_vport_transact(&vport_request, NULL, NULL);
>  
> +            vport_del_channel(handler, j);
> +        }
> +        handler->n_channels = 0;
> +        free(handler->channels);
> +        handler->channels = NULL;
>          dpif_netlink_handler_uninit(handler);
>          free(handler->epoll_events);
> +        handler->epoll_events = NULL;
>      }
> -    free(dpif->channels);
>      free(dpif->handlers);
>      dpif->handlers = NULL;
> -    dpif->channels = NULL;
>      dpif->n_handlers = 0;
>      dpif->uc_array_size = 0;
>  }
> @@ -1091,13 +1121,17 @@ dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif,
>          /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
>           * channel, since it is not heavily loaded. */
>          uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx;
> +        size_t i;
>  
>          /* Needs to check in case the socket pointer is changed in between
>           * the holding of upcall_lock.  A known case happens when the main
>           * thread deletes the vport while the handler thread is handling
>           * the upcall from that port. */
> -        if (dpif->channels[idx].sock) {
> -            pid = nl_sock_pid(dpif->channels[idx].sock);
> +        for (i = 0; i < dpif->n_handlers; ++i) {
> +            if (idx < dpif->handlers[i].n_channels &&
> +                dpif->handlers[i].channels[idx].sock) {
> +                pid = nl_sock_pid(dpif->handlers[i].channels[idx].sock);
> +            }
>          }
>      }
>  
> @@ -2738,7 +2772,7 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>  
>      while (handler->event_offset < handler->n_events) {
>          int idx = handler->epoll_events[handler->event_offset].data.u32;
> -        struct dpif_channel *ch = &dpif->channels[idx];
> +        struct dpif_channel *ch = &handler->channels[idx];
>  
>          handler->event_offset++;
>  
> @@ -2840,14 +2874,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif)
>      OVS_REQ_WRLOCK(dpif->upcall_lock)
>  {
>      if (dpif->handlers) {
> -        size_t i;
> +        size_t i, j;
>  
> -        if (!dpif->channels[0].sock) {
> -            return;
> -        }
> -        for (i = 0; i < dpif->uc_array_size; i++ ) {
> -
> -            nl_sock_drain(dpif->channels[i].sock);
> +        for (j = 0; j < dpif->n_handlers; ++j) {
> +            for (i = 0; i < dpif->handlers[j].n_channels; i++ ) {
> +                if (dpif->handlers[j].channels[i].sock) {
> +                    nl_sock_drain(dpif->handlers[j].channels[i].sock);
> +                }
> +            }
>          }
>      }
>  }
> -- 
> 2.25.1
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Aaron Conole July 6, 2020, 3:01 p.m. UTC | #4
Flavio Leitner <fbl@sysclose.org> writes:

> On Wed, Jun 24, 2020 at 09:48:00AM -0400, Aaron Conole wrote:
>> Currently, the channel handlers are polled globally.  On some
>> systems, this causes a thundering herd issue where multiple
>> handler threads become active, only to do no work and immediately
>> sleep.
>> 
>> The approach here is to push the netlink socket channels to discreet
>> handler threads to process, rather than polling on every thread.
>> This will eliminate the need to wake multiple threads.
>> 
>> To check:
>> 
>>   ip netns add left
>>   ip netns add right
>>   ip link add center-left type veth peer name left0
>>   ip link add center-right type veth peer name right0
>>   ip link set left0 netns left
>>   ip link set right0 netns right
>>   ip link set center-left up
>>   ip link set center-right up
>>   ip netns exec left ip link set left0 up
>>   ip netns exec left ip addr add 172.31.110.10/24 dev left0
>>   ip netns exec right ip link set right0 up
>>   ip netns exec right ip addr add 172.31.110.11/24 dev right0
>> 
>>   ovs-vsctl add-br br0
>>   ovs-vsctl add-port br0 center-right
>>   ovs-vsctl add-port br0 center-left
>> 
>>   # in one terminal
>>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
>> 
>>   # in a separate terminal
>>   ip netns exec left arping -I left0 -c 1 172.31.110.11
>> 
>>   # in the perf terminal after exiting
>>   perf script
>> 
>> Look for the number of 'handler' threads which were made active.
>
> I am surprised to find out that in 2020 and Linux doesn't provide
> a reasonable solution for the thundering herd issue. I wonder how
> many applications have this issue out there. It looks like the
> devil in is the details.

It turns out that it's complicated.  And the way OVS is structured,
simply switching the poll loop to epoll() isn't as straightforward -
epoll() would require a completely new design for the way FDs are
polled.  And then it wouldn't fit with other systems (like Windows or
*BSD), and we'd need to implement things quite differently.  It's very
intrusive and painful to get right.

> Anyways, I think the previous model didn't work because the number
> of attached ports is increasing and so is the number of cores. The
> number of sockets escalates quickly and it had the packet re-ordering
> issue.

Yes - that's true.  We would do n_ports * n_cores sockets.  Now we just
do n_ports sockets.

> Moving forward, we moved to one socket per port with all threads
> polling the same socket in the hope that the kernel would only wake
> one or very few threads. Looks like that doesn't work and the packet
> re-ordering is still an issue.

Yes - because what happens is the first level epoll() may not trigger,
then the threads go into the deep poll_wait(), which falls back to the
poll() systemcall (which uses the older semantics and introduces
thundering herd - see the discussion linked).  Then when a packet comes
in to trigger, kernel will start waking all the threads.

> Therefore this approach where there is one socket per port but only
> thread polling would solve the thundering herd issue and the packet
> re-ordering issue.
>
> One concern is with load capacity, since now only one thread would
> be responsible to process a port, but I think that's what we had
> initially, correct? So, it's not an issue.

Yes.  Actually, I think it's might be better this way - if every packet
for a port is getting slowpathed, we will keep all the port processing
to the same core, so no more out-of-order (which may have performance
impacts on TCP) and since it's slowpath, performance won't really matter
that much anyway.

> It looks like now it is allocating enough memory for all port_idx
> per thread even though it will not use all of them. Perhaps that
> can be fixed? For instance, instead of indexing the array with
> port_idx, use handler->port_idx as an identifier? 

Yes, that's why I posted as RFC.  I plan to clean up the API / structure
a little bit this week after finishing going through my email from PTO.

> More inline
>
>> Reported-by: David Ahern <dsahern@gmail.com>
>> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
>> Cc: Matteo Croce <technoboy85@gmail.com>
>> Cc: Flavio Leitner <fbl@sysclose.org>
>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>> ---
>> I haven't finished all my testing, but I wanted to send this
>> for early feedback in case I went completely off course.
>> 
>>  lib/dpif-netlink.c | 190 ++++++++++++++++++++++++++-------------------
>>  1 file changed, 112 insertions(+), 78 deletions(-)
>> 
>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
>> index 1817e9f849..d59375cf00 100644
>> --- a/lib/dpif-netlink.c
>> +++ b/lib/dpif-netlink.c
>> @@ -179,9 +179,11 @@ struct dpif_windows_vport_sock {
>>  
>>  struct dpif_handler {
>>      struct epoll_event *epoll_events;
>> -    int epoll_fd;                 /* epoll fd that includes channel socks. */
>> -    int n_events;                 /* Num events returned by epoll_wait(). */
>> -    int event_offset;             /* Offset into 'epoll_events'. */
>> +    struct dpif_channel *channels; /* Array of channels for each port. */
>> +    size_t n_channels;             /* Count of channels for each port. */
>> +    int epoll_fd;                  /* epoll fd that includes channel socks. */
>> +    int n_events;                  /* Num events returned by epoll_wait(). */
>> +    int event_offset;              /* Offset into 'epoll_events'. */
>>  
>>  #ifdef _WIN32
>>      /* Pool of sockets. */
>> @@ -201,7 +203,6 @@ struct dpif_netlink {
>>      struct fat_rwlock upcall_lock;
>>      struct dpif_handler *handlers;
>>      uint32_t n_handlers;           /* Num of upcall handlers. */
>> -    struct dpif_channel *channels; /* Array of channels for each port. */
>>      int uc_array_size;             /* Size of 'handler->channels' and */
>>                                     /* 'handler->epoll_events'. */
>>  
>> @@ -452,15 +453,22 @@ static bool
>>  vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx,
>>                uint32_t *upcall_pid)
>>  {
>> -    /* Since the nl_sock can only be assigned in either all
>> -     * or none "dpif" channels, the following check
>> -     * would suffice. */
>> -    if (!dpif->channels[port_idx].sock) {
>> -        return false;
>> -    }
>> +    size_t i;
>>      ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
>>  
>> -    *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock);
>> +    /* The 'port_idx' should only be valid for a single handler. */
>> +    for (i = 0; i < dpif->n_handlers; ++i) {
>> +
>> +        if (port_idx < dpif->handlers[i].n_channels &&
>> +            dpif->handlers[i].channels[port_idx].sock) {
>> +            *upcall_pid =
>> +                nl_sock_pid(dpif->handlers[i].channels[port_idx].sock);
>> +            break;
>> +        }
>> +    }
>
>
> Perhaps
>
>      /* The 'port_idx' should only be valid for a single handler. */
> -    for (i = 0; i < dpif->n_handlers; ++i) {
> +    for (i = 0; i < dpif->n_handlers; i++) {

You don't like pre-increment operator?  It's an old habit from writing lots
of c++ :-)

> +        struct dpif_handler *handler = &dpif->handlers[i];
>  
> -        if (port_idx < dpif->handlers[i].n_channels &&
> -            dpif->handlers[i].channels[port_idx].sock) {
> +        if (port_idx < handler->n_channels &&
> +            handler->channels[port_idx].sock) {
>              *upcall_pid =
> -                nl_sock_pid(dpif->handlers[i].channels[port_idx].sock);
> +                nl_sock_pid(handler->channels[port_idx].sock);
>              break;
>          }
>      }
>
> the same for similar places below.

I plan on replacing all the open-coded loops with a new FOR_EACH

Thanks for the review, Flavio!

>
>> +
>> +    if (i == dpif->n_handlers)
>> +        return false;
>>  
>>      return true;
>>  }
>> @@ -473,15 +481,23 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
>>      uint32_t port_idx = odp_to_u32(port_no);
>>      size_t i;
>>      int error;
>> +    struct dpif_handler *handler = NULL;
>>  
>>      if (dpif->handlers == NULL) {
>>          close_nl_sock(sock);
>>          return 0;
>>      }
>>  
>> +    /* choose a handler by finding the least populated handler. */
>> +    handler = &dpif->handlers[0];
>> +    for (i = 0; i < dpif->n_handlers; ++i) {
>> +        if (dpif->handlers[i].n_channels < handler->n_channels)
>> +            handler = &dpif->handlers[i];
>> +    }
>> +
>>      /* We assume that the datapath densely chooses port numbers, which can
>>       * therefore be used as an index into 'channels' and 'epoll_events' of
>> -     * 'dpif'. */
>> +     * 'dpif->handler'. */
>>      if (port_idx >= dpif->uc_array_size) {
>>          uint32_t new_size = port_idx + 1;
>>  
>> @@ -491,40 +507,33 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
>>              return EFBIG;
>>          }
>>  
>> -        dpif->channels = xrealloc(dpif->channels,
>> -                                  new_size * sizeof *dpif->channels);
>> +        handler->channels = xrealloc(handler->channels,
>> +                                     new_size * sizeof *handler->channels);
>>  
>> -        for (i = dpif->uc_array_size; i < new_size; i++) {
>> -            dpif->channels[i].sock = NULL;
>> +        for (i = handler->n_channels; i < new_size; i++) {
>> +            handler->channels[i].sock = NULL;
>>          }
>>  
>> -        for (i = 0; i < dpif->n_handlers; i++) {
>> -            struct dpif_handler *handler = &dpif->handlers[i];
>> -
>> -            handler->epoll_events = xrealloc(handler->epoll_events,
>> -                new_size * sizeof *handler->epoll_events);
>> -
>> -        }
>> +        handler->epoll_events = xrealloc(handler->epoll_events,
>> +                                         new_size * sizeof *handler->epoll_events);
>>          dpif->uc_array_size = new_size;
>> +        handler->n_channels = new_size;
>>      }
>>  
>>      memset(&event, 0, sizeof event);
>>      event.events = EPOLLIN | EPOLLEXCLUSIVE;
>>      event.data.u32 = port_idx;
>>  
>> -    for (i = 0; i < dpif->n_handlers; i++) {
>> -        struct dpif_handler *handler = &dpif->handlers[i];
>> -
>>  #ifndef _WIN32
>> -        if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
>> -                      &event) < 0) {
>> -            error = errno;
>> -            goto error;
>> -        }
>> -#endif
>> +    if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
>> +                  &event) < 0) {
>> +        error = errno;
>> +        goto error;
>>      }
>> -    dpif->channels[port_idx].sock = sock;
>> -    dpif->channels[port_idx].last_poll = LLONG_MIN;
>> +#endif
>> +
>> +    handler->channels[port_idx].sock = sock;
>> +    handler->channels[port_idx].last_poll = LLONG_MIN;
>>  
>>      return 0;
>>  
>> @@ -535,34 +544,53 @@ error:
>>                    nl_sock_fd(sock), NULL);
>>      }
>>  #endif
>> -    dpif->channels[port_idx].sock = NULL;
>> +    handler->channels[port_idx].sock = NULL;
>>  
>>      return error;
>>  }
>>  
>> +static void
>> +vport_del_channel(struct dpif_handler *handler, uint32_t port_idx)
>> +{
>> +    if (port_idx < handler->n_channels &&
>> +        handler->channels[port_idx].sock) {
>> +#ifndef _WIN32
>> +        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
>> +                  nl_sock_fd(handler->channels[port_idx].sock), NULL);
>> +#endif
>> +        handler->event_offset = handler->n_events = 0;
>> +
>> +#ifndef _WIN32
>> +        nl_sock_destroy(handler->channels[port_idx].sock);
>> +#endif
>> +        handler->channels[port_idx].sock = NULL;
>> +    }        
>> +}
>> +
>>  static void
>>  vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no)
>>  {
>>      uint32_t port_idx = odp_to_u32(port_no);
>> +    struct dpif_handler *handler = NULL;
>>      size_t i;
>>  
>> -    if (!dpif->handlers || port_idx >= dpif->uc_array_size
>> -        || !dpif->channels[port_idx].sock) {
>> +    if (!dpif->handlers || port_idx >= dpif->uc_array_size) {
>>          return;
>>      }
>>  
>>      for (i = 0; i < dpif->n_handlers; i++) {
>> -        struct dpif_handler *handler = &dpif->handlers[i];
>> -#ifndef _WIN32
>> -        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
>> -                  nl_sock_fd(dpif->channels[port_idx].sock), NULL);
>> -#endif
>> -        handler->event_offset = handler->n_events = 0;
>> +        handler = &dpif->handlers[i];
>> +
>> +        if (port_idx >= handler->n_channels ||
>> +            !handler->channels[port_idx].sock) {
>> +            handler = NULL;
>> +            continue;
>> +        }
>> +    }
>> +
>> +    if (handler) {
>> +        vport_del_channel(handler, port_idx);
>>      }
>> -#ifndef _WIN32
>> -    nl_sock_destroy(dpif->channels[port_idx].sock);
>> -#endif
>> -    dpif->channels[port_idx].sock = NULL;
>>  }
>>  
>>  static void
>> @@ -575,36 +603,38 @@ destroy_all_channels(struct dpif_netlink *dpif)
>>          return;
>>      }
>>  
>> -    for (i = 0; i < dpif->uc_array_size; i++ ) {
>> -        struct dpif_netlink_vport vport_request;
>> -        uint32_t upcall_pids = 0;
>> -
>> -        if (!dpif->channels[i].sock) {
>> -            continue;
>> -        }
>> +    for (i = 0; i < dpif->n_handlers; i++) {
>> +        struct dpif_handler *handler = &dpif->handlers[i];
>> +        size_t j;
>>  
>> -        /* Turn off upcalls. */
>> -        dpif_netlink_vport_init(&vport_request);
>> -        vport_request.cmd = OVS_VPORT_CMD_SET;
>> -        vport_request.dp_ifindex = dpif->dp_ifindex;
>> -        vport_request.port_no = u32_to_odp(i);
>> -        vport_request.n_upcall_pids = 1;
>> -        vport_request.upcall_pids = &upcall_pids;
>> -        dpif_netlink_vport_transact(&vport_request, NULL, NULL);
>> +        for (j = 0; j < handler->n_channels; j++ ) {
>> +            struct dpif_netlink_vport vport_request;
>> +            uint32_t upcall_pids = 0;
>>  
>> -        vport_del_channels(dpif, u32_to_odp(i));
>> -    }
>> +            if (!handler->channels[j].sock) {
>> +                continue;
>> +            }
>>  
>> -    for (i = 0; i < dpif->n_handlers; i++) {
>> -        struct dpif_handler *handler = &dpif->handlers[i];
>> +            /* Turn off upcalls. */
>> +            dpif_netlink_vport_init(&vport_request);
>> +            vport_request.cmd = OVS_VPORT_CMD_SET;
>> +            vport_request.dp_ifindex = dpif->dp_ifindex;
>> +            vport_request.port_no = u32_to_odp(j);
>> +            vport_request.n_upcall_pids = 1;
>> +            vport_request.upcall_pids = &upcall_pids;
>> +            dpif_netlink_vport_transact(&vport_request, NULL, NULL);
>>  
>> +            vport_del_channel(handler, j);
>> +        }
>> +        handler->n_channels = 0;
>> +        free(handler->channels);
>> +        handler->channels = NULL;
>>          dpif_netlink_handler_uninit(handler);
>>          free(handler->epoll_events);
>> +        handler->epoll_events = NULL;
>>      }
>> -    free(dpif->channels);
>>      free(dpif->handlers);
>>      dpif->handlers = NULL;
>> -    dpif->channels = NULL;
>>      dpif->n_handlers = 0;
>>      dpif->uc_array_size = 0;
>>  }
>> @@ -1091,13 +1121,17 @@ dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif,
>>          /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
>>           * channel, since it is not heavily loaded. */
>>          uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx;
>> +        size_t i;
>>  
>>          /* Needs to check in case the socket pointer is changed in between
>>           * the holding of upcall_lock.  A known case happens when the main
>>           * thread deletes the vport while the handler thread is handling
>>           * the upcall from that port. */
>> -        if (dpif->channels[idx].sock) {
>> -            pid = nl_sock_pid(dpif->channels[idx].sock);
>> +        for (i = 0; i < dpif->n_handlers; ++i) {
>> +            if (idx < dpif->handlers[i].n_channels &&
>> +                dpif->handlers[i].channels[idx].sock) {
>> +                pid = nl_sock_pid(dpif->handlers[i].channels[idx].sock);
>> +            }
>>          }
>>      }
>>  
>> @@ -2738,7 +2772,7 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
>>  
>>      while (handler->event_offset < handler->n_events) {
>>          int idx = handler->epoll_events[handler->event_offset].data.u32;
>> -        struct dpif_channel *ch = &dpif->channels[idx];
>> +        struct dpif_channel *ch = &handler->channels[idx];
>>  
>>          handler->event_offset++;
>>  
>> @@ -2840,14 +2874,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif)
>>      OVS_REQ_WRLOCK(dpif->upcall_lock)
>>  {
>>      if (dpif->handlers) {
>> -        size_t i;
>> +        size_t i, j;
>>  
>> -        if (!dpif->channels[0].sock) {
>> -            return;
>> -        }
>> -        for (i = 0; i < dpif->uc_array_size; i++ ) {
>> -
>> -            nl_sock_drain(dpif->channels[i].sock);
>> +        for (j = 0; j < dpif->n_handlers; ++j) {
>> +            for (i = 0; i < dpif->handlers[j].n_channels; i++ ) {
>> +                if (dpif->handlers[j].channels[i].sock) {
>> +                    nl_sock_drain(dpif->handlers[j].channels[i].sock);
>> +                }
>> +            }
>>          }
>>      }
>>  }
>> -- 
>> 2.25.1
>> 
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 1817e9f849..d59375cf00 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -179,9 +179,11 @@  struct dpif_windows_vport_sock {
 
 struct dpif_handler {
     struct epoll_event *epoll_events;
-    int epoll_fd;                 /* epoll fd that includes channel socks. */
-    int n_events;                 /* Num events returned by epoll_wait(). */
-    int event_offset;             /* Offset into 'epoll_events'. */
+    struct dpif_channel *channels; /* Array of channels for each port. */
+    size_t n_channels;             /* Count of channels for each port. */
+    int epoll_fd;                  /* epoll fd that includes channel socks. */
+    int n_events;                  /* Num events returned by epoll_wait(). */
+    int event_offset;              /* Offset into 'epoll_events'. */
 
 #ifdef _WIN32
     /* Pool of sockets. */
@@ -201,7 +203,6 @@  struct dpif_netlink {
     struct fat_rwlock upcall_lock;
     struct dpif_handler *handlers;
     uint32_t n_handlers;           /* Num of upcall handlers. */
-    struct dpif_channel *channels; /* Array of channels for each port. */
     int uc_array_size;             /* Size of 'handler->channels' and */
                                    /* 'handler->epoll_events'. */
 
@@ -452,15 +453,22 @@  static bool
 vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx,
               uint32_t *upcall_pid)
 {
-    /* Since the nl_sock can only be assigned in either all
-     * or none "dpif" channels, the following check
-     * would suffice. */
-    if (!dpif->channels[port_idx].sock) {
-        return false;
-    }
+    size_t i;
     ovs_assert(!WINDOWS || dpif->n_handlers <= 1);
 
-    *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock);
+    /* The 'port_idx' should only be valid for a single handler. */
+    for (i = 0; i < dpif->n_handlers; ++i) {
+
+        if (port_idx < dpif->handlers[i].n_channels &&
+            dpif->handlers[i].channels[port_idx].sock) {
+            *upcall_pid =
+                nl_sock_pid(dpif->handlers[i].channels[port_idx].sock);
+            break;
+        }
+    }
+
+    if (i == dpif->n_handlers)
+        return false;
 
     return true;
 }
@@ -473,15 +481,23 @@  vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
     uint32_t port_idx = odp_to_u32(port_no);
     size_t i;
     int error;
+    struct dpif_handler *handler = NULL;
 
     if (dpif->handlers == NULL) {
         close_nl_sock(sock);
         return 0;
     }
 
+    /* choose a handler by finding the least populated handler. */
+    handler = &dpif->handlers[0];
+    for (i = 0; i < dpif->n_handlers; ++i) {
+        if (dpif->handlers[i].n_channels < handler->n_channels)
+            handler = &dpif->handlers[i];
+    }
+
     /* We assume that the datapath densely chooses port numbers, which can
      * therefore be used as an index into 'channels' and 'epoll_events' of
-     * 'dpif'. */
+     * 'dpif->handler'. */
     if (port_idx >= dpif->uc_array_size) {
         uint32_t new_size = port_idx + 1;
 
@@ -491,40 +507,33 @@  vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no,
             return EFBIG;
         }
 
-        dpif->channels = xrealloc(dpif->channels,
-                                  new_size * sizeof *dpif->channels);
+        handler->channels = xrealloc(handler->channels,
+                                     new_size * sizeof *handler->channels);
 
-        for (i = dpif->uc_array_size; i < new_size; i++) {
-            dpif->channels[i].sock = NULL;
+        for (i = handler->n_channels; i < new_size; i++) {
+            handler->channels[i].sock = NULL;
         }
 
-        for (i = 0; i < dpif->n_handlers; i++) {
-            struct dpif_handler *handler = &dpif->handlers[i];
-
-            handler->epoll_events = xrealloc(handler->epoll_events,
-                new_size * sizeof *handler->epoll_events);
-
-        }
+        handler->epoll_events = xrealloc(handler->epoll_events,
+                                         new_size * sizeof *handler->epoll_events);
         dpif->uc_array_size = new_size;
+        handler->n_channels = new_size;
     }
 
     memset(&event, 0, sizeof event);
     event.events = EPOLLIN | EPOLLEXCLUSIVE;
     event.data.u32 = port_idx;
 
-    for (i = 0; i < dpif->n_handlers; i++) {
-        struct dpif_handler *handler = &dpif->handlers[i];
-
 #ifndef _WIN32
-        if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
-                      &event) < 0) {
-            error = errno;
-            goto error;
-        }
-#endif
+    if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock),
+                  &event) < 0) {
+        error = errno;
+        goto error;
     }
-    dpif->channels[port_idx].sock = sock;
-    dpif->channels[port_idx].last_poll = LLONG_MIN;
+#endif
+
+    handler->channels[port_idx].sock = sock;
+    handler->channels[port_idx].last_poll = LLONG_MIN;
 
     return 0;
 
@@ -535,34 +544,53 @@  error:
                   nl_sock_fd(sock), NULL);
     }
 #endif
-    dpif->channels[port_idx].sock = NULL;
+    handler->channels[port_idx].sock = NULL;
 
     return error;
 }
 
+static void
+vport_del_channel(struct dpif_handler *handler, uint32_t port_idx)
+{
+    if (port_idx < handler->n_channels &&
+        handler->channels[port_idx].sock) {
+#ifndef _WIN32
+        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
+                  nl_sock_fd(handler->channels[port_idx].sock), NULL);
+#endif
+        handler->event_offset = handler->n_events = 0;
+
+#ifndef _WIN32
+        nl_sock_destroy(handler->channels[port_idx].sock);
+#endif
+        handler->channels[port_idx].sock = NULL;
+    }        
+}
+
 static void
 vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no)
 {
     uint32_t port_idx = odp_to_u32(port_no);
+    struct dpif_handler *handler = NULL;
     size_t i;
 
-    if (!dpif->handlers || port_idx >= dpif->uc_array_size
-        || !dpif->channels[port_idx].sock) {
+    if (!dpif->handlers || port_idx >= dpif->uc_array_size) {
         return;
     }
 
     for (i = 0; i < dpif->n_handlers; i++) {
-        struct dpif_handler *handler = &dpif->handlers[i];
-#ifndef _WIN32
-        epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL,
-                  nl_sock_fd(dpif->channels[port_idx].sock), NULL);
-#endif
-        handler->event_offset = handler->n_events = 0;
+        handler = &dpif->handlers[i];
+
+        if (port_idx >= handler->n_channels ||
+            !handler->channels[port_idx].sock) {
+            handler = NULL;
+            continue;
+        }
+    }
+
+    if (handler) {
+        vport_del_channel(handler, port_idx);
     }
-#ifndef _WIN32
-    nl_sock_destroy(dpif->channels[port_idx].sock);
-#endif
-    dpif->channels[port_idx].sock = NULL;
 }
 
 static void
@@ -575,36 +603,38 @@  destroy_all_channels(struct dpif_netlink *dpif)
         return;
     }
 
-    for (i = 0; i < dpif->uc_array_size; i++ ) {
-        struct dpif_netlink_vport vport_request;
-        uint32_t upcall_pids = 0;
-
-        if (!dpif->channels[i].sock) {
-            continue;
-        }
+    for (i = 0; i < dpif->n_handlers; i++) {
+        struct dpif_handler *handler = &dpif->handlers[i];
+        size_t j;
 
-        /* Turn off upcalls. */
-        dpif_netlink_vport_init(&vport_request);
-        vport_request.cmd = OVS_VPORT_CMD_SET;
-        vport_request.dp_ifindex = dpif->dp_ifindex;
-        vport_request.port_no = u32_to_odp(i);
-        vport_request.n_upcall_pids = 1;
-        vport_request.upcall_pids = &upcall_pids;
-        dpif_netlink_vport_transact(&vport_request, NULL, NULL);
+        for (j = 0; j < handler->n_channels; j++ ) {
+            struct dpif_netlink_vport vport_request;
+            uint32_t upcall_pids = 0;
 
-        vport_del_channels(dpif, u32_to_odp(i));
-    }
+            if (!handler->channels[j].sock) {
+                continue;
+            }
 
-    for (i = 0; i < dpif->n_handlers; i++) {
-        struct dpif_handler *handler = &dpif->handlers[i];
+            /* Turn off upcalls. */
+            dpif_netlink_vport_init(&vport_request);
+            vport_request.cmd = OVS_VPORT_CMD_SET;
+            vport_request.dp_ifindex = dpif->dp_ifindex;
+            vport_request.port_no = u32_to_odp(j);
+            vport_request.n_upcall_pids = 1;
+            vport_request.upcall_pids = &upcall_pids;
+            dpif_netlink_vport_transact(&vport_request, NULL, NULL);
 
+            vport_del_channel(handler, j);
+        }
+        handler->n_channels = 0;
+        free(handler->channels);
+        handler->channels = NULL;
         dpif_netlink_handler_uninit(handler);
         free(handler->epoll_events);
+        handler->epoll_events = NULL;
     }
-    free(dpif->channels);
     free(dpif->handlers);
     dpif->handlers = NULL;
-    dpif->channels = NULL;
     dpif->n_handlers = 0;
     dpif->uc_array_size = 0;
 }
@@ -1091,13 +1121,17 @@  dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif,
         /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s
          * channel, since it is not heavily loaded. */
         uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx;
+        size_t i;
 
         /* Needs to check in case the socket pointer is changed in between
          * the holding of upcall_lock.  A known case happens when the main
          * thread deletes the vport while the handler thread is handling
          * the upcall from that port. */
-        if (dpif->channels[idx].sock) {
-            pid = nl_sock_pid(dpif->channels[idx].sock);
+        for (i = 0; i < dpif->n_handlers; ++i) {
+            if (idx < dpif->handlers[i].n_channels &&
+                dpif->handlers[i].channels[idx].sock) {
+                pid = nl_sock_pid(dpif->handlers[i].channels[idx].sock);
+            }
         }
     }
 
@@ -2738,7 +2772,7 @@  dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id,
 
     while (handler->event_offset < handler->n_events) {
         int idx = handler->epoll_events[handler->event_offset].data.u32;
-        struct dpif_channel *ch = &dpif->channels[idx];
+        struct dpif_channel *ch = &handler->channels[idx];
 
         handler->event_offset++;
 
@@ -2840,14 +2874,14 @@  dpif_netlink_recv_purge__(struct dpif_netlink *dpif)
     OVS_REQ_WRLOCK(dpif->upcall_lock)
 {
     if (dpif->handlers) {
-        size_t i;
+        size_t i, j;
 
-        if (!dpif->channels[0].sock) {
-            return;
-        }
-        for (i = 0; i < dpif->uc_array_size; i++ ) {
-
-            nl_sock_drain(dpif->channels[i].sock);
+        for (j = 0; j < dpif->n_handlers; ++j) {
+            for (i = 0; i < dpif->handlers[j].n_channels; i++ ) {
+                if (dpif->handlers[j].channels[i].sock) {
+                    nl_sock_drain(dpif->handlers[j].channels[i].sock);
+                }
+            }
         }
     }
 }