Message ID | 20200624134800.2905038-1-aconole@redhat.com |
---|---|
State | Changes Requested |
Headers | show |
Series | [ovs-dev,RFC] dpif-netlink: distribute polling to discreet handlers | expand |
On Wed, Jun 24, 2020 at 3:48 PM Aaron Conole <aconole@redhat.com> wrote: > > Currently, the channel handlers are polled globally. On some > systems, this causes a thundering herd issue where multiple > handler threads become active, only to do no work and immediately > sleep. > > The approach here is to push the netlink socket channels to discreet > handler threads to process, rather than polling on every thread. > This will eliminate the need to wake multiple threads. > > To check: > > ip netns add left > ip netns add right > ip link add center-left type veth peer name left0 > ip link add center-right type veth peer name right0 > ip link set left0 netns left > ip link set right0 netns right > ip link set center-left up > ip link set center-right up > ip netns exec left ip link set left0 up > ip netns exec left ip addr add 172.31.110.10/24 dev left0 > ip netns exec right ip link set right0 up > ip netns exec right ip addr add 172.31.110.11/24 dev right0 > > ovs-vsctl add-br br0 > ovs-vsctl add-port br0 center-right > ovs-vsctl add-port br0 center-left > > # in one terminal > perf record -e sched:sched_wakeup,irq:softirq_entry -ag > > # in a separate terminal > ip netns exec left arping -I left0 -c 1 172.31.110.11 > > # in the perf terminal after exiting > perf script > > Look for the number of 'handler' threads which were made active. > > Reported-by: David Ahern <dsahern@gmail.com> > Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html > Cc: Matteo Croce <technoboy85@gmail.com> > Cc: Flavio Leitner <fbl@sysclose.org> > Signed-off-by: Aaron Conole <aconole@redhat.com> > Great work! Just to understand, this reverts some logic of my patch, and transposes the threads and sockets? Maybe it's the case to add a Fixes tag pointing to 69c51582f ? Nitpick: `iproute -n` instead of `ip netns exec` where possible, e.g. ip netns exec left ip link set left0 up becomes ip -n left link set left0 up Cheers,
Matteo Croce <technoboy85@gmail.com> writes: > On Wed, Jun 24, 2020 at 3:48 PM Aaron Conole <aconole@redhat.com> wrote: >> >> Currently, the channel handlers are polled globally. On some >> systems, this causes a thundering herd issue where multiple >> handler threads become active, only to do no work and immediately >> sleep. >> >> The approach here is to push the netlink socket channels to discreet >> handler threads to process, rather than polling on every thread. >> This will eliminate the need to wake multiple threads. >> >> To check: >> >> ip netns add left >> ip netns add right >> ip link add center-left type veth peer name left0 >> ip link add center-right type veth peer name right0 >> ip link set left0 netns left >> ip link set right0 netns right >> ip link set center-left up >> ip link set center-right up >> ip netns exec left ip link set left0 up >> ip netns exec left ip addr add 172.31.110.10/24 dev left0 >> ip netns exec right ip link set right0 up >> ip netns exec right ip addr add 172.31.110.11/24 dev right0 >> >> ovs-vsctl add-br br0 >> ovs-vsctl add-port br0 center-right >> ovs-vsctl add-port br0 center-left >> >> # in one terminal >> perf record -e sched:sched_wakeup,irq:softirq_entry -ag >> >> # in a separate terminal >> ip netns exec left arping -I left0 -c 1 172.31.110.11 >> >> # in the perf terminal after exiting >> perf script >> >> Look for the number of 'handler' threads which were made active. >> >> Reported-by: David Ahern <dsahern@gmail.com> >> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html >> Cc: Matteo Croce <technoboy85@gmail.com> >> Cc: Flavio Leitner <fbl@sysclose.org> >> Signed-off-by: Aaron Conole <aconole@redhat.com> >> > > Great work! > > Just to understand, this reverts some logic of my patch, and > transposes the threads and sockets? Yes. > Maybe it's the case to add a Fixes tag pointing to 69c51582f ? Okay, I can include it when I post with removing the RFC tag. > Nitpick: `iproute -n` instead of `ip netns exec` where possible, e.g. > > ip netns exec left ip link set left0 up > > becomes > > ip -n left link set left0 up Neat! > Cheers,
On Wed, Jun 24, 2020 at 09:48:00AM -0400, Aaron Conole wrote: > Currently, the channel handlers are polled globally. On some > systems, this causes a thundering herd issue where multiple > handler threads become active, only to do no work and immediately > sleep. > > The approach here is to push the netlink socket channels to discreet > handler threads to process, rather than polling on every thread. > This will eliminate the need to wake multiple threads. > > To check: > > ip netns add left > ip netns add right > ip link add center-left type veth peer name left0 > ip link add center-right type veth peer name right0 > ip link set left0 netns left > ip link set right0 netns right > ip link set center-left up > ip link set center-right up > ip netns exec left ip link set left0 up > ip netns exec left ip addr add 172.31.110.10/24 dev left0 > ip netns exec right ip link set right0 up > ip netns exec right ip addr add 172.31.110.11/24 dev right0 > > ovs-vsctl add-br br0 > ovs-vsctl add-port br0 center-right > ovs-vsctl add-port br0 center-left > > # in one terminal > perf record -e sched:sched_wakeup,irq:softirq_entry -ag > > # in a separate terminal > ip netns exec left arping -I left0 -c 1 172.31.110.11 > > # in the perf terminal after exiting > perf script > > Look for the number of 'handler' threads which were made active. I am surprised to find out that in 2020 and Linux doesn't provide a reasonable solution for the thundering herd issue. I wonder how many applications have this issue out there. It looks like the devil in is the details. Anyways, I think the previous model didn't work because the number of attached ports is increasing and so is the number of cores. The number of sockets escalates quickly and it had the packet re-ordering issue. Moving forward, we moved to one socket per port with all threads polling the same socket in the hope that the kernel would only wake one or very few threads. Looks like that doesn't work and the packet re-ordering is still an issue. Therefore this approach where there is one socket per port but only thread polling would solve the thundering herd issue and the packet re-ordering issue. One concern is with load capacity, since now only one thread would be responsible to process a port, but I think that's what we had initially, correct? So, it's not an issue. It looks like now it is allocating enough memory for all port_idx per thread even though it will not use all of them. Perhaps that can be fixed? For instance, instead of indexing the array with port_idx, use handler->port_idx as an identifier? More inline > Reported-by: David Ahern <dsahern@gmail.com> > Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html > Cc: Matteo Croce <technoboy85@gmail.com> > Cc: Flavio Leitner <fbl@sysclose.org> > Signed-off-by: Aaron Conole <aconole@redhat.com> > --- > I haven't finished all my testing, but I wanted to send this > for early feedback in case I went completely off course. > > lib/dpif-netlink.c | 190 ++++++++++++++++++++++++++------------------- > 1 file changed, 112 insertions(+), 78 deletions(-) > > diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c > index 1817e9f849..d59375cf00 100644 > --- a/lib/dpif-netlink.c > +++ b/lib/dpif-netlink.c > @@ -179,9 +179,11 @@ struct dpif_windows_vport_sock { > > struct dpif_handler { > struct epoll_event *epoll_events; > - int epoll_fd; /* epoll fd that includes channel socks. */ > - int n_events; /* Num events returned by epoll_wait(). */ > - int event_offset; /* Offset into 'epoll_events'. */ > + struct dpif_channel *channels; /* Array of channels for each port. */ > + size_t n_channels; /* Count of channels for each port. */ > + int epoll_fd; /* epoll fd that includes channel socks. */ > + int n_events; /* Num events returned by epoll_wait(). */ > + int event_offset; /* Offset into 'epoll_events'. */ > > #ifdef _WIN32 > /* Pool of sockets. */ > @@ -201,7 +203,6 @@ struct dpif_netlink { > struct fat_rwlock upcall_lock; > struct dpif_handler *handlers; > uint32_t n_handlers; /* Num of upcall handlers. */ > - struct dpif_channel *channels; /* Array of channels for each port. */ > int uc_array_size; /* Size of 'handler->channels' and */ > /* 'handler->epoll_events'. */ > > @@ -452,15 +453,22 @@ static bool > vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx, > uint32_t *upcall_pid) > { > - /* Since the nl_sock can only be assigned in either all > - * or none "dpif" channels, the following check > - * would suffice. */ > - if (!dpif->channels[port_idx].sock) { > - return false; > - } > + size_t i; > ovs_assert(!WINDOWS || dpif->n_handlers <= 1); > > - *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock); > + /* The 'port_idx' should only be valid for a single handler. */ > + for (i = 0; i < dpif->n_handlers; ++i) { > + > + if (port_idx < dpif->handlers[i].n_channels && > + dpif->handlers[i].channels[port_idx].sock) { > + *upcall_pid = > + nl_sock_pid(dpif->handlers[i].channels[port_idx].sock); > + break; > + } > + } Perhaps /* The 'port_idx' should only be valid for a single handler. */ - for (i = 0; i < dpif->n_handlers; ++i) { + for (i = 0; i < dpif->n_handlers; i++) { + struct dpif_handler *handler = &dpif->handlers[i]; - if (port_idx < dpif->handlers[i].n_channels && - dpif->handlers[i].channels[port_idx].sock) { + if (port_idx < handler->n_channels && + handler->channels[port_idx].sock) { *upcall_pid = - nl_sock_pid(dpif->handlers[i].channels[port_idx].sock); + nl_sock_pid(handler->channels[port_idx].sock); break; } } the same for similar places below. > + > + if (i == dpif->n_handlers) > + return false; > > return true; > } > @@ -473,15 +481,23 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no, > uint32_t port_idx = odp_to_u32(port_no); > size_t i; > int error; > + struct dpif_handler *handler = NULL; > > if (dpif->handlers == NULL) { > close_nl_sock(sock); > return 0; > } > > + /* choose a handler by finding the least populated handler. */ > + handler = &dpif->handlers[0]; > + for (i = 0; i < dpif->n_handlers; ++i) { > + if (dpif->handlers[i].n_channels < handler->n_channels) > + handler = &dpif->handlers[i]; > + } > + > /* We assume that the datapath densely chooses port numbers, which can > * therefore be used as an index into 'channels' and 'epoll_events' of > - * 'dpif'. */ > + * 'dpif->handler'. */ > if (port_idx >= dpif->uc_array_size) { > uint32_t new_size = port_idx + 1; > > @@ -491,40 +507,33 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no, > return EFBIG; > } > > - dpif->channels = xrealloc(dpif->channels, > - new_size * sizeof *dpif->channels); > + handler->channels = xrealloc(handler->channels, > + new_size * sizeof *handler->channels); > > - for (i = dpif->uc_array_size; i < new_size; i++) { > - dpif->channels[i].sock = NULL; > + for (i = handler->n_channels; i < new_size; i++) { > + handler->channels[i].sock = NULL; > } > > - for (i = 0; i < dpif->n_handlers; i++) { > - struct dpif_handler *handler = &dpif->handlers[i]; > - > - handler->epoll_events = xrealloc(handler->epoll_events, > - new_size * sizeof *handler->epoll_events); > - > - } > + handler->epoll_events = xrealloc(handler->epoll_events, > + new_size * sizeof *handler->epoll_events); > dpif->uc_array_size = new_size; > + handler->n_channels = new_size; > } > > memset(&event, 0, sizeof event); > event.events = EPOLLIN | EPOLLEXCLUSIVE; > event.data.u32 = port_idx; > > - for (i = 0; i < dpif->n_handlers; i++) { > - struct dpif_handler *handler = &dpif->handlers[i]; > - > #ifndef _WIN32 > - if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock), > - &event) < 0) { > - error = errno; > - goto error; > - } > -#endif > + if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock), > + &event) < 0) { > + error = errno; > + goto error; > } > - dpif->channels[port_idx].sock = sock; > - dpif->channels[port_idx].last_poll = LLONG_MIN; > +#endif > + > + handler->channels[port_idx].sock = sock; > + handler->channels[port_idx].last_poll = LLONG_MIN; > > return 0; > > @@ -535,34 +544,53 @@ error: > nl_sock_fd(sock), NULL); > } > #endif > - dpif->channels[port_idx].sock = NULL; > + handler->channels[port_idx].sock = NULL; > > return error; > } > > +static void > +vport_del_channel(struct dpif_handler *handler, uint32_t port_idx) > +{ > + if (port_idx < handler->n_channels && > + handler->channels[port_idx].sock) { > +#ifndef _WIN32 > + epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL, > + nl_sock_fd(handler->channels[port_idx].sock), NULL); > +#endif > + handler->event_offset = handler->n_events = 0; > + > +#ifndef _WIN32 > + nl_sock_destroy(handler->channels[port_idx].sock); > +#endif > + handler->channels[port_idx].sock = NULL; > + } > +} > + > static void > vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no) > { > uint32_t port_idx = odp_to_u32(port_no); > + struct dpif_handler *handler = NULL; > size_t i; > > - if (!dpif->handlers || port_idx >= dpif->uc_array_size > - || !dpif->channels[port_idx].sock) { > + if (!dpif->handlers || port_idx >= dpif->uc_array_size) { > return; > } > > for (i = 0; i < dpif->n_handlers; i++) { > - struct dpif_handler *handler = &dpif->handlers[i]; > -#ifndef _WIN32 > - epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL, > - nl_sock_fd(dpif->channels[port_idx].sock), NULL); > -#endif > - handler->event_offset = handler->n_events = 0; > + handler = &dpif->handlers[i]; > + > + if (port_idx >= handler->n_channels || > + !handler->channels[port_idx].sock) { > + handler = NULL; > + continue; > + } > + } > + > + if (handler) { > + vport_del_channel(handler, port_idx); > } > -#ifndef _WIN32 > - nl_sock_destroy(dpif->channels[port_idx].sock); > -#endif > - dpif->channels[port_idx].sock = NULL; > } > > static void > @@ -575,36 +603,38 @@ destroy_all_channels(struct dpif_netlink *dpif) > return; > } > > - for (i = 0; i < dpif->uc_array_size; i++ ) { > - struct dpif_netlink_vport vport_request; > - uint32_t upcall_pids = 0; > - > - if (!dpif->channels[i].sock) { > - continue; > - } > + for (i = 0; i < dpif->n_handlers; i++) { > + struct dpif_handler *handler = &dpif->handlers[i]; > + size_t j; > > - /* Turn off upcalls. */ > - dpif_netlink_vport_init(&vport_request); > - vport_request.cmd = OVS_VPORT_CMD_SET; > - vport_request.dp_ifindex = dpif->dp_ifindex; > - vport_request.port_no = u32_to_odp(i); > - vport_request.n_upcall_pids = 1; > - vport_request.upcall_pids = &upcall_pids; > - dpif_netlink_vport_transact(&vport_request, NULL, NULL); > + for (j = 0; j < handler->n_channels; j++ ) { > + struct dpif_netlink_vport vport_request; > + uint32_t upcall_pids = 0; > > - vport_del_channels(dpif, u32_to_odp(i)); > - } > + if (!handler->channels[j].sock) { > + continue; > + } > > - for (i = 0; i < dpif->n_handlers; i++) { > - struct dpif_handler *handler = &dpif->handlers[i]; > + /* Turn off upcalls. */ > + dpif_netlink_vport_init(&vport_request); > + vport_request.cmd = OVS_VPORT_CMD_SET; > + vport_request.dp_ifindex = dpif->dp_ifindex; > + vport_request.port_no = u32_to_odp(j); > + vport_request.n_upcall_pids = 1; > + vport_request.upcall_pids = &upcall_pids; > + dpif_netlink_vport_transact(&vport_request, NULL, NULL); > > + vport_del_channel(handler, j); > + } > + handler->n_channels = 0; > + free(handler->channels); > + handler->channels = NULL; > dpif_netlink_handler_uninit(handler); > free(handler->epoll_events); > + handler->epoll_events = NULL; > } > - free(dpif->channels); > free(dpif->handlers); > dpif->handlers = NULL; > - dpif->channels = NULL; > dpif->n_handlers = 0; > dpif->uc_array_size = 0; > } > @@ -1091,13 +1121,17 @@ dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif, > /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s > * channel, since it is not heavily loaded. */ > uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx; > + size_t i; > > /* Needs to check in case the socket pointer is changed in between > * the holding of upcall_lock. A known case happens when the main > * thread deletes the vport while the handler thread is handling > * the upcall from that port. */ > - if (dpif->channels[idx].sock) { > - pid = nl_sock_pid(dpif->channels[idx].sock); > + for (i = 0; i < dpif->n_handlers; ++i) { > + if (idx < dpif->handlers[i].n_channels && > + dpif->handlers[i].channels[idx].sock) { > + pid = nl_sock_pid(dpif->handlers[i].channels[idx].sock); > + } > } > } > > @@ -2738,7 +2772,7 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id, > > while (handler->event_offset < handler->n_events) { > int idx = handler->epoll_events[handler->event_offset].data.u32; > - struct dpif_channel *ch = &dpif->channels[idx]; > + struct dpif_channel *ch = &handler->channels[idx]; > > handler->event_offset++; > > @@ -2840,14 +2874,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif) > OVS_REQ_WRLOCK(dpif->upcall_lock) > { > if (dpif->handlers) { > - size_t i; > + size_t i, j; > > - if (!dpif->channels[0].sock) { > - return; > - } > - for (i = 0; i < dpif->uc_array_size; i++ ) { > - > - nl_sock_drain(dpif->channels[i].sock); > + for (j = 0; j < dpif->n_handlers; ++j) { > + for (i = 0; i < dpif->handlers[j].n_channels; i++ ) { > + if (dpif->handlers[j].channels[i].sock) { > + nl_sock_drain(dpif->handlers[j].channels[i].sock); > + } > + } > } > } > } > -- > 2.25.1 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Flavio Leitner <fbl@sysclose.org> writes: > On Wed, Jun 24, 2020 at 09:48:00AM -0400, Aaron Conole wrote: >> Currently, the channel handlers are polled globally. On some >> systems, this causes a thundering herd issue where multiple >> handler threads become active, only to do no work and immediately >> sleep. >> >> The approach here is to push the netlink socket channels to discreet >> handler threads to process, rather than polling on every thread. >> This will eliminate the need to wake multiple threads. >> >> To check: >> >> ip netns add left >> ip netns add right >> ip link add center-left type veth peer name left0 >> ip link add center-right type veth peer name right0 >> ip link set left0 netns left >> ip link set right0 netns right >> ip link set center-left up >> ip link set center-right up >> ip netns exec left ip link set left0 up >> ip netns exec left ip addr add 172.31.110.10/24 dev left0 >> ip netns exec right ip link set right0 up >> ip netns exec right ip addr add 172.31.110.11/24 dev right0 >> >> ovs-vsctl add-br br0 >> ovs-vsctl add-port br0 center-right >> ovs-vsctl add-port br0 center-left >> >> # in one terminal >> perf record -e sched:sched_wakeup,irq:softirq_entry -ag >> >> # in a separate terminal >> ip netns exec left arping -I left0 -c 1 172.31.110.11 >> >> # in the perf terminal after exiting >> perf script >> >> Look for the number of 'handler' threads which were made active. > > I am surprised to find out that in 2020 and Linux doesn't provide > a reasonable solution for the thundering herd issue. I wonder how > many applications have this issue out there. It looks like the > devil in is the details. It turns out that it's complicated. And the way OVS is structured, simply switching the poll loop to epoll() isn't as straightforward - epoll() would require a completely new design for the way FDs are polled. And then it wouldn't fit with other systems (like Windows or *BSD), and we'd need to implement things quite differently. It's very intrusive and painful to get right. > Anyways, I think the previous model didn't work because the number > of attached ports is increasing and so is the number of cores. The > number of sockets escalates quickly and it had the packet re-ordering > issue. Yes - that's true. We would do n_ports * n_cores sockets. Now we just do n_ports sockets. > Moving forward, we moved to one socket per port with all threads > polling the same socket in the hope that the kernel would only wake > one or very few threads. Looks like that doesn't work and the packet > re-ordering is still an issue. Yes - because what happens is the first level epoll() may not trigger, then the threads go into the deep poll_wait(), which falls back to the poll() systemcall (which uses the older semantics and introduces thundering herd - see the discussion linked). Then when a packet comes in to trigger, kernel will start waking all the threads. > Therefore this approach where there is one socket per port but only > thread polling would solve the thundering herd issue and the packet > re-ordering issue. > > One concern is with load capacity, since now only one thread would > be responsible to process a port, but I think that's what we had > initially, correct? So, it's not an issue. Yes. Actually, I think it's might be better this way - if every packet for a port is getting slowpathed, we will keep all the port processing to the same core, so no more out-of-order (which may have performance impacts on TCP) and since it's slowpath, performance won't really matter that much anyway. > It looks like now it is allocating enough memory for all port_idx > per thread even though it will not use all of them. Perhaps that > can be fixed? For instance, instead of indexing the array with > port_idx, use handler->port_idx as an identifier? Yes, that's why I posted as RFC. I plan to clean up the API / structure a little bit this week after finishing going through my email from PTO. > More inline > >> Reported-by: David Ahern <dsahern@gmail.com> >> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html >> Cc: Matteo Croce <technoboy85@gmail.com> >> Cc: Flavio Leitner <fbl@sysclose.org> >> Signed-off-by: Aaron Conole <aconole@redhat.com> >> --- >> I haven't finished all my testing, but I wanted to send this >> for early feedback in case I went completely off course. >> >> lib/dpif-netlink.c | 190 ++++++++++++++++++++++++++------------------- >> 1 file changed, 112 insertions(+), 78 deletions(-) >> >> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c >> index 1817e9f849..d59375cf00 100644 >> --- a/lib/dpif-netlink.c >> +++ b/lib/dpif-netlink.c >> @@ -179,9 +179,11 @@ struct dpif_windows_vport_sock { >> >> struct dpif_handler { >> struct epoll_event *epoll_events; >> - int epoll_fd; /* epoll fd that includes channel socks. */ >> - int n_events; /* Num events returned by epoll_wait(). */ >> - int event_offset; /* Offset into 'epoll_events'. */ >> + struct dpif_channel *channels; /* Array of channels for each port. */ >> + size_t n_channels; /* Count of channels for each port. */ >> + int epoll_fd; /* epoll fd that includes channel socks. */ >> + int n_events; /* Num events returned by epoll_wait(). */ >> + int event_offset; /* Offset into 'epoll_events'. */ >> >> #ifdef _WIN32 >> /* Pool of sockets. */ >> @@ -201,7 +203,6 @@ struct dpif_netlink { >> struct fat_rwlock upcall_lock; >> struct dpif_handler *handlers; >> uint32_t n_handlers; /* Num of upcall handlers. */ >> - struct dpif_channel *channels; /* Array of channels for each port. */ >> int uc_array_size; /* Size of 'handler->channels' and */ >> /* 'handler->epoll_events'. */ >> >> @@ -452,15 +453,22 @@ static bool >> vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx, >> uint32_t *upcall_pid) >> { >> - /* Since the nl_sock can only be assigned in either all >> - * or none "dpif" channels, the following check >> - * would suffice. */ >> - if (!dpif->channels[port_idx].sock) { >> - return false; >> - } >> + size_t i; >> ovs_assert(!WINDOWS || dpif->n_handlers <= 1); >> >> - *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock); >> + /* The 'port_idx' should only be valid for a single handler. */ >> + for (i = 0; i < dpif->n_handlers; ++i) { >> + >> + if (port_idx < dpif->handlers[i].n_channels && >> + dpif->handlers[i].channels[port_idx].sock) { >> + *upcall_pid = >> + nl_sock_pid(dpif->handlers[i].channels[port_idx].sock); >> + break; >> + } >> + } > > > Perhaps > > /* The 'port_idx' should only be valid for a single handler. */ > - for (i = 0; i < dpif->n_handlers; ++i) { > + for (i = 0; i < dpif->n_handlers; i++) { You don't like pre-increment operator? It's an old habit from writing lots of c++ :-) > + struct dpif_handler *handler = &dpif->handlers[i]; > > - if (port_idx < dpif->handlers[i].n_channels && > - dpif->handlers[i].channels[port_idx].sock) { > + if (port_idx < handler->n_channels && > + handler->channels[port_idx].sock) { > *upcall_pid = > - nl_sock_pid(dpif->handlers[i].channels[port_idx].sock); > + nl_sock_pid(handler->channels[port_idx].sock); > break; > } > } > > the same for similar places below. I plan on replacing all the open-coded loops with a new FOR_EACH Thanks for the review, Flavio! > >> + >> + if (i == dpif->n_handlers) >> + return false; >> >> return true; >> } >> @@ -473,15 +481,23 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no, >> uint32_t port_idx = odp_to_u32(port_no); >> size_t i; >> int error; >> + struct dpif_handler *handler = NULL; >> >> if (dpif->handlers == NULL) { >> close_nl_sock(sock); >> return 0; >> } >> >> + /* choose a handler by finding the least populated handler. */ >> + handler = &dpif->handlers[0]; >> + for (i = 0; i < dpif->n_handlers; ++i) { >> + if (dpif->handlers[i].n_channels < handler->n_channels) >> + handler = &dpif->handlers[i]; >> + } >> + >> /* We assume that the datapath densely chooses port numbers, which can >> * therefore be used as an index into 'channels' and 'epoll_events' of >> - * 'dpif'. */ >> + * 'dpif->handler'. */ >> if (port_idx >= dpif->uc_array_size) { >> uint32_t new_size = port_idx + 1; >> >> @@ -491,40 +507,33 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no, >> return EFBIG; >> } >> >> - dpif->channels = xrealloc(dpif->channels, >> - new_size * sizeof *dpif->channels); >> + handler->channels = xrealloc(handler->channels, >> + new_size * sizeof *handler->channels); >> >> - for (i = dpif->uc_array_size; i < new_size; i++) { >> - dpif->channels[i].sock = NULL; >> + for (i = handler->n_channels; i < new_size; i++) { >> + handler->channels[i].sock = NULL; >> } >> >> - for (i = 0; i < dpif->n_handlers; i++) { >> - struct dpif_handler *handler = &dpif->handlers[i]; >> - >> - handler->epoll_events = xrealloc(handler->epoll_events, >> - new_size * sizeof *handler->epoll_events); >> - >> - } >> + handler->epoll_events = xrealloc(handler->epoll_events, >> + new_size * sizeof *handler->epoll_events); >> dpif->uc_array_size = new_size; >> + handler->n_channels = new_size; >> } >> >> memset(&event, 0, sizeof event); >> event.events = EPOLLIN | EPOLLEXCLUSIVE; >> event.data.u32 = port_idx; >> >> - for (i = 0; i < dpif->n_handlers; i++) { >> - struct dpif_handler *handler = &dpif->handlers[i]; >> - >> #ifndef _WIN32 >> - if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock), >> - &event) < 0) { >> - error = errno; >> - goto error; >> - } >> -#endif >> + if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock), >> + &event) < 0) { >> + error = errno; >> + goto error; >> } >> - dpif->channels[port_idx].sock = sock; >> - dpif->channels[port_idx].last_poll = LLONG_MIN; >> +#endif >> + >> + handler->channels[port_idx].sock = sock; >> + handler->channels[port_idx].last_poll = LLONG_MIN; >> >> return 0; >> >> @@ -535,34 +544,53 @@ error: >> nl_sock_fd(sock), NULL); >> } >> #endif >> - dpif->channels[port_idx].sock = NULL; >> + handler->channels[port_idx].sock = NULL; >> >> return error; >> } >> >> +static void >> +vport_del_channel(struct dpif_handler *handler, uint32_t port_idx) >> +{ >> + if (port_idx < handler->n_channels && >> + handler->channels[port_idx].sock) { >> +#ifndef _WIN32 >> + epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL, >> + nl_sock_fd(handler->channels[port_idx].sock), NULL); >> +#endif >> + handler->event_offset = handler->n_events = 0; >> + >> +#ifndef _WIN32 >> + nl_sock_destroy(handler->channels[port_idx].sock); >> +#endif >> + handler->channels[port_idx].sock = NULL; >> + } >> +} >> + >> static void >> vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no) >> { >> uint32_t port_idx = odp_to_u32(port_no); >> + struct dpif_handler *handler = NULL; >> size_t i; >> >> - if (!dpif->handlers || port_idx >= dpif->uc_array_size >> - || !dpif->channels[port_idx].sock) { >> + if (!dpif->handlers || port_idx >= dpif->uc_array_size) { >> return; >> } >> >> for (i = 0; i < dpif->n_handlers; i++) { >> - struct dpif_handler *handler = &dpif->handlers[i]; >> -#ifndef _WIN32 >> - epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL, >> - nl_sock_fd(dpif->channels[port_idx].sock), NULL); >> -#endif >> - handler->event_offset = handler->n_events = 0; >> + handler = &dpif->handlers[i]; >> + >> + if (port_idx >= handler->n_channels || >> + !handler->channels[port_idx].sock) { >> + handler = NULL; >> + continue; >> + } >> + } >> + >> + if (handler) { >> + vport_del_channel(handler, port_idx); >> } >> -#ifndef _WIN32 >> - nl_sock_destroy(dpif->channels[port_idx].sock); >> -#endif >> - dpif->channels[port_idx].sock = NULL; >> } >> >> static void >> @@ -575,36 +603,38 @@ destroy_all_channels(struct dpif_netlink *dpif) >> return; >> } >> >> - for (i = 0; i < dpif->uc_array_size; i++ ) { >> - struct dpif_netlink_vport vport_request; >> - uint32_t upcall_pids = 0; >> - >> - if (!dpif->channels[i].sock) { >> - continue; >> - } >> + for (i = 0; i < dpif->n_handlers; i++) { >> + struct dpif_handler *handler = &dpif->handlers[i]; >> + size_t j; >> >> - /* Turn off upcalls. */ >> - dpif_netlink_vport_init(&vport_request); >> - vport_request.cmd = OVS_VPORT_CMD_SET; >> - vport_request.dp_ifindex = dpif->dp_ifindex; >> - vport_request.port_no = u32_to_odp(i); >> - vport_request.n_upcall_pids = 1; >> - vport_request.upcall_pids = &upcall_pids; >> - dpif_netlink_vport_transact(&vport_request, NULL, NULL); >> + for (j = 0; j < handler->n_channels; j++ ) { >> + struct dpif_netlink_vport vport_request; >> + uint32_t upcall_pids = 0; >> >> - vport_del_channels(dpif, u32_to_odp(i)); >> - } >> + if (!handler->channels[j].sock) { >> + continue; >> + } >> >> - for (i = 0; i < dpif->n_handlers; i++) { >> - struct dpif_handler *handler = &dpif->handlers[i]; >> + /* Turn off upcalls. */ >> + dpif_netlink_vport_init(&vport_request); >> + vport_request.cmd = OVS_VPORT_CMD_SET; >> + vport_request.dp_ifindex = dpif->dp_ifindex; >> + vport_request.port_no = u32_to_odp(j); >> + vport_request.n_upcall_pids = 1; >> + vport_request.upcall_pids = &upcall_pids; >> + dpif_netlink_vport_transact(&vport_request, NULL, NULL); >> >> + vport_del_channel(handler, j); >> + } >> + handler->n_channels = 0; >> + free(handler->channels); >> + handler->channels = NULL; >> dpif_netlink_handler_uninit(handler); >> free(handler->epoll_events); >> + handler->epoll_events = NULL; >> } >> - free(dpif->channels); >> free(dpif->handlers); >> dpif->handlers = NULL; >> - dpif->channels = NULL; >> dpif->n_handlers = 0; >> dpif->uc_array_size = 0; >> } >> @@ -1091,13 +1121,17 @@ dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif, >> /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s >> * channel, since it is not heavily loaded. */ >> uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx; >> + size_t i; >> >> /* Needs to check in case the socket pointer is changed in between >> * the holding of upcall_lock. A known case happens when the main >> * thread deletes the vport while the handler thread is handling >> * the upcall from that port. */ >> - if (dpif->channels[idx].sock) { >> - pid = nl_sock_pid(dpif->channels[idx].sock); >> + for (i = 0; i < dpif->n_handlers; ++i) { >> + if (idx < dpif->handlers[i].n_channels && >> + dpif->handlers[i].channels[idx].sock) { >> + pid = nl_sock_pid(dpif->handlers[i].channels[idx].sock); >> + } >> } >> } >> >> @@ -2738,7 +2772,7 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id, >> >> while (handler->event_offset < handler->n_events) { >> int idx = handler->epoll_events[handler->event_offset].data.u32; >> - struct dpif_channel *ch = &dpif->channels[idx]; >> + struct dpif_channel *ch = &handler->channels[idx]; >> >> handler->event_offset++; >> >> @@ -2840,14 +2874,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif) >> OVS_REQ_WRLOCK(dpif->upcall_lock) >> { >> if (dpif->handlers) { >> - size_t i; >> + size_t i, j; >> >> - if (!dpif->channels[0].sock) { >> - return; >> - } >> - for (i = 0; i < dpif->uc_array_size; i++ ) { >> - >> - nl_sock_drain(dpif->channels[i].sock); >> + for (j = 0; j < dpif->n_handlers; ++j) { >> + for (i = 0; i < dpif->handlers[j].n_channels; i++ ) { >> + if (dpif->handlers[j].channels[i].sock) { >> + nl_sock_drain(dpif->handlers[j].channels[i].sock); >> + } >> + } >> } >> } >> } >> -- >> 2.25.1 >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index 1817e9f849..d59375cf00 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -179,9 +179,11 @@ struct dpif_windows_vport_sock { struct dpif_handler { struct epoll_event *epoll_events; - int epoll_fd; /* epoll fd that includes channel socks. */ - int n_events; /* Num events returned by epoll_wait(). */ - int event_offset; /* Offset into 'epoll_events'. */ + struct dpif_channel *channels; /* Array of channels for each port. */ + size_t n_channels; /* Count of channels for each port. */ + int epoll_fd; /* epoll fd that includes channel socks. */ + int n_events; /* Num events returned by epoll_wait(). */ + int event_offset; /* Offset into 'epoll_events'. */ #ifdef _WIN32 /* Pool of sockets. */ @@ -201,7 +203,6 @@ struct dpif_netlink { struct fat_rwlock upcall_lock; struct dpif_handler *handlers; uint32_t n_handlers; /* Num of upcall handlers. */ - struct dpif_channel *channels; /* Array of channels for each port. */ int uc_array_size; /* Size of 'handler->channels' and */ /* 'handler->epoll_events'. */ @@ -452,15 +453,22 @@ static bool vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx, uint32_t *upcall_pid) { - /* Since the nl_sock can only be assigned in either all - * or none "dpif" channels, the following check - * would suffice. */ - if (!dpif->channels[port_idx].sock) { - return false; - } + size_t i; ovs_assert(!WINDOWS || dpif->n_handlers <= 1); - *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock); + /* The 'port_idx' should only be valid for a single handler. */ + for (i = 0; i < dpif->n_handlers; ++i) { + + if (port_idx < dpif->handlers[i].n_channels && + dpif->handlers[i].channels[port_idx].sock) { + *upcall_pid = + nl_sock_pid(dpif->handlers[i].channels[port_idx].sock); + break; + } + } + + if (i == dpif->n_handlers) + return false; return true; } @@ -473,15 +481,23 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no, uint32_t port_idx = odp_to_u32(port_no); size_t i; int error; + struct dpif_handler *handler = NULL; if (dpif->handlers == NULL) { close_nl_sock(sock); return 0; } + /* choose a handler by finding the least populated handler. */ + handler = &dpif->handlers[0]; + for (i = 0; i < dpif->n_handlers; ++i) { + if (dpif->handlers[i].n_channels < handler->n_channels) + handler = &dpif->handlers[i]; + } + /* We assume that the datapath densely chooses port numbers, which can * therefore be used as an index into 'channels' and 'epoll_events' of - * 'dpif'. */ + * 'dpif->handler'. */ if (port_idx >= dpif->uc_array_size) { uint32_t new_size = port_idx + 1; @@ -491,40 +507,33 @@ vport_add_channel(struct dpif_netlink *dpif, odp_port_t port_no, return EFBIG; } - dpif->channels = xrealloc(dpif->channels, - new_size * sizeof *dpif->channels); + handler->channels = xrealloc(handler->channels, + new_size * sizeof *handler->channels); - for (i = dpif->uc_array_size; i < new_size; i++) { - dpif->channels[i].sock = NULL; + for (i = handler->n_channels; i < new_size; i++) { + handler->channels[i].sock = NULL; } - for (i = 0; i < dpif->n_handlers; i++) { - struct dpif_handler *handler = &dpif->handlers[i]; - - handler->epoll_events = xrealloc(handler->epoll_events, - new_size * sizeof *handler->epoll_events); - - } + handler->epoll_events = xrealloc(handler->epoll_events, + new_size * sizeof *handler->epoll_events); dpif->uc_array_size = new_size; + handler->n_channels = new_size; } memset(&event, 0, sizeof event); event.events = EPOLLIN | EPOLLEXCLUSIVE; event.data.u32 = port_idx; - for (i = 0; i < dpif->n_handlers; i++) { - struct dpif_handler *handler = &dpif->handlers[i]; - #ifndef _WIN32 - if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock), - &event) < 0) { - error = errno; - goto error; - } -#endif + if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock), + &event) < 0) { + error = errno; + goto error; } - dpif->channels[port_idx].sock = sock; - dpif->channels[port_idx].last_poll = LLONG_MIN; +#endif + + handler->channels[port_idx].sock = sock; + handler->channels[port_idx].last_poll = LLONG_MIN; return 0; @@ -535,34 +544,53 @@ error: nl_sock_fd(sock), NULL); } #endif - dpif->channels[port_idx].sock = NULL; + handler->channels[port_idx].sock = NULL; return error; } +static void +vport_del_channel(struct dpif_handler *handler, uint32_t port_idx) +{ + if (port_idx < handler->n_channels && + handler->channels[port_idx].sock) { +#ifndef _WIN32 + epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL, + nl_sock_fd(handler->channels[port_idx].sock), NULL); +#endif + handler->event_offset = handler->n_events = 0; + +#ifndef _WIN32 + nl_sock_destroy(handler->channels[port_idx].sock); +#endif + handler->channels[port_idx].sock = NULL; + } +} + static void vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no) { uint32_t port_idx = odp_to_u32(port_no); + struct dpif_handler *handler = NULL; size_t i; - if (!dpif->handlers || port_idx >= dpif->uc_array_size - || !dpif->channels[port_idx].sock) { + if (!dpif->handlers || port_idx >= dpif->uc_array_size) { return; } for (i = 0; i < dpif->n_handlers; i++) { - struct dpif_handler *handler = &dpif->handlers[i]; -#ifndef _WIN32 - epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL, - nl_sock_fd(dpif->channels[port_idx].sock), NULL); -#endif - handler->event_offset = handler->n_events = 0; + handler = &dpif->handlers[i]; + + if (port_idx >= handler->n_channels || + !handler->channels[port_idx].sock) { + handler = NULL; + continue; + } + } + + if (handler) { + vport_del_channel(handler, port_idx); } -#ifndef _WIN32 - nl_sock_destroy(dpif->channels[port_idx].sock); -#endif - dpif->channels[port_idx].sock = NULL; } static void @@ -575,36 +603,38 @@ destroy_all_channels(struct dpif_netlink *dpif) return; } - for (i = 0; i < dpif->uc_array_size; i++ ) { - struct dpif_netlink_vport vport_request; - uint32_t upcall_pids = 0; - - if (!dpif->channels[i].sock) { - continue; - } + for (i = 0; i < dpif->n_handlers; i++) { + struct dpif_handler *handler = &dpif->handlers[i]; + size_t j; - /* Turn off upcalls. */ - dpif_netlink_vport_init(&vport_request); - vport_request.cmd = OVS_VPORT_CMD_SET; - vport_request.dp_ifindex = dpif->dp_ifindex; - vport_request.port_no = u32_to_odp(i); - vport_request.n_upcall_pids = 1; - vport_request.upcall_pids = &upcall_pids; - dpif_netlink_vport_transact(&vport_request, NULL, NULL); + for (j = 0; j < handler->n_channels; j++ ) { + struct dpif_netlink_vport vport_request; + uint32_t upcall_pids = 0; - vport_del_channels(dpif, u32_to_odp(i)); - } + if (!handler->channels[j].sock) { + continue; + } - for (i = 0; i < dpif->n_handlers; i++) { - struct dpif_handler *handler = &dpif->handlers[i]; + /* Turn off upcalls. */ + dpif_netlink_vport_init(&vport_request); + vport_request.cmd = OVS_VPORT_CMD_SET; + vport_request.dp_ifindex = dpif->dp_ifindex; + vport_request.port_no = u32_to_odp(j); + vport_request.n_upcall_pids = 1; + vport_request.upcall_pids = &upcall_pids; + dpif_netlink_vport_transact(&vport_request, NULL, NULL); + vport_del_channel(handler, j); + } + handler->n_channels = 0; + free(handler->channels); + handler->channels = NULL; dpif_netlink_handler_uninit(handler); free(handler->epoll_events); + handler->epoll_events = NULL; } - free(dpif->channels); free(dpif->handlers); dpif->handlers = NULL; - dpif->channels = NULL; dpif->n_handlers = 0; dpif->uc_array_size = 0; } @@ -1091,13 +1121,17 @@ dpif_netlink_port_get_pid__(const struct dpif_netlink *dpif, /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s * channel, since it is not heavily loaded. */ uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx; + size_t i; /* Needs to check in case the socket pointer is changed in between * the holding of upcall_lock. A known case happens when the main * thread deletes the vport while the handler thread is handling * the upcall from that port. */ - if (dpif->channels[idx].sock) { - pid = nl_sock_pid(dpif->channels[idx].sock); + for (i = 0; i < dpif->n_handlers; ++i) { + if (idx < dpif->handlers[i].n_channels && + dpif->handlers[i].channels[idx].sock) { + pid = nl_sock_pid(dpif->handlers[i].channels[idx].sock); + } } } @@ -2738,7 +2772,7 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, uint32_t handler_id, while (handler->event_offset < handler->n_events) { int idx = handler->epoll_events[handler->event_offset].data.u32; - struct dpif_channel *ch = &dpif->channels[idx]; + struct dpif_channel *ch = &handler->channels[idx]; handler->event_offset++; @@ -2840,14 +2874,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif) OVS_REQ_WRLOCK(dpif->upcall_lock) { if (dpif->handlers) { - size_t i; + size_t i, j; - if (!dpif->channels[0].sock) { - return; - } - for (i = 0; i < dpif->uc_array_size; i++ ) { - - nl_sock_drain(dpif->channels[i].sock); + for (j = 0; j < dpif->n_handlers; ++j) { + for (i = 0; i < dpif->handlers[j].n_channels; i++ ) { + if (dpif->handlers[j].channels[i].sock) { + nl_sock_drain(dpif->handlers[j].channels[i].sock); + } + } } } }
Currently, the channel handlers are polled globally. On some systems, this causes a thundering herd issue where multiple handler threads become active, only to do no work and immediately sleep. The approach here is to push the netlink socket channels to discreet handler threads to process, rather than polling on every thread. This will eliminate the need to wake multiple threads. To check: ip netns add left ip netns add right ip link add center-left type veth peer name left0 ip link add center-right type veth peer name right0 ip link set left0 netns left ip link set right0 netns right ip link set center-left up ip link set center-right up ip netns exec left ip link set left0 up ip netns exec left ip addr add 172.31.110.10/24 dev left0 ip netns exec right ip link set right0 up ip netns exec right ip addr add 172.31.110.11/24 dev right0 ovs-vsctl add-br br0 ovs-vsctl add-port br0 center-right ovs-vsctl add-port br0 center-left # in one terminal perf record -e sched:sched_wakeup,irq:softirq_entry -ag # in a separate terminal ip netns exec left arping -I left0 -c 1 172.31.110.11 # in the perf terminal after exiting perf script Look for the number of 'handler' threads which were made active. Reported-by: David Ahern <dsahern@gmail.com> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html Cc: Matteo Croce <technoboy85@gmail.com> Cc: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Aaron Conole <aconole@redhat.com> --- I haven't finished all my testing, but I wanted to send this for early feedback in case I went completely off course. lib/dpif-netlink.c | 190 ++++++++++++++++++++++++++------------------- 1 file changed, 112 insertions(+), 78 deletions(-)