diff mbox

[ovs-dev] windows: Crash when the handle communication device cannot be found

Message ID 20170217123501.9372-1-aserdean@cloudbasesolutions.com
State Changes Requested
Headers show

Commit Message

Alin Serdean Feb. 17, 2017, 12:35 p.m. UTC
When trying to uninstall/disable the OVS extension the driver will
fail to unload properly(require reboot)/hang until ovs-vswitchd is closed.

The root cause of this behavior is because the handles from ovs-vswitchd
to the kernel communication devices are still opened although the
actual device was removed from the kernel.

Trying to close the handles will also fail because they do not exist.

The remaining option is to cause a crash and rely on the service manager
to restart ovs-vswitchd.

Reported-at: https://github.com/openvswitch/ovs-issues/issues/27
Reported-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
Signed-off-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
---
 lib/netlink-socket.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

Comments

Ben Pfaff March 8, 2017, 11:18 p.m. UTC | #1
On Fri, Feb 17, 2017 at 12:35:08PM +0000, Alin Serdean wrote:
> When trying to uninstall/disable the OVS extension the driver will
> fail to unload properly(require reboot)/hang until ovs-vswitchd is closed.
> 
> The root cause of this behavior is because the handles from ovs-vswitchd
> to the kernel communication devices are still opened although the
> actual device was removed from the kernel.
> 
> Trying to close the handles will also fail because they do not exist.
> 
> The remaining option is to cause a crash and rely on the service manager
> to restart ovs-vswitchd.
> 
> Reported-at: https://github.com/openvswitch/ovs-issues/issues/27
> Reported-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
> Signed-off-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>

I suggest using ovs_abort() instead of ovs_fatal() in this case, because
of the intent of ovs_abort():

 * This function is preferred to ovs_fatal() in a situation where it would make
 * sense for a monitoring process to restart the daemon.

On Linux, using ovs_abort() actually causes the monitoring process to
automatically restart the daemon.  On Windows, I don't know whether it
has that effect, but it at least makes the intent clear in the source
code.

I'd still like to see a review from a Windows developer.
Alin Serdean March 9, 2017, 9:58 a.m. UTC | #2
> On Fri, Feb 17, 2017 at 12:35:08PM +0000, Alin Serdean wrote:
> > When trying to uninstall/disable the OVS extension the driver will
> > fail to unload properly(require reboot)/hang until ovs-vswitchd is closed.
> >
> > The root cause of this behavior is because the handles from
> > ovs-vswitchd to the kernel communication devices are still opened
> > although the actual device was removed from the kernel.
> >
> > Trying to close the handles will also fail because they do not exist.
> >
> > The remaining option is to cause a crash and rely on the service
> > manager to restart ovs-vswitchd.
> >
> > Reported-at: https://github.com/openvswitch/ovs-issues/issues/27
> > Reported-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
> > Signed-off-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
> 
> I suggest using ovs_abort() instead of ovs_fatal() in this case, because of the
> intent of ovs_abort():
> 
>  * This function is preferred to ovs_fatal() in a situation where it would make
>  * sense for a monitoring process to restart the daemon.
> 
> On Linux, using ovs_abort() actually causes the monitoring process to
> automatically restart the daemon.  On Windows, I don't know whether it has
> that effect, but it at least makes the intent clear in the source code.
> 
> I'd still like to see a review from a Windows developer.
[Alin Serdean] Thanks for the review Ben. Both `ovs_fatal()` and `ovs_abort()` will have the same outcome(restarting the service).
Shashank Ram March 9, 2017, 6:15 p.m. UTC | #3

diff mbox

Patch

diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
index e45914c..4f56c60 100644
--- a/lib/netlink-socket.c
+++ b/lib/netlink-socket.c
@@ -60,6 +60,20 @@  static void log_nlmsg(const char *function, int error,
 #ifdef _WIN32
 static int get_sock_pid_from_kernel(struct nl_sock *sock);
 static int set_sock_property(struct nl_sock *sock);
+
+/* In the case DeviceIoControl failed and GetLastError returns with
+ * ERROR_NOT_FOUND means we lost communication with the kernel device.
+ * CloseHandle will fail because the handle in 'theory' does not exist.
+ * The only remaining option is to crash and allow the service to be restarted
+ * via service manager.  This is the only way to close the handle from both
+ * userspace and kernel. */
+void
+lost_communication(DWORD last_err)
+{
+    if (last_err == ERROR_NOT_FOUND) {
+        ovs_fatal(0, "lost communication with the kernel device");
+    }
+}
 #endif
 
 /* Netlink sockets. */
@@ -278,6 +292,7 @@  get_sock_pid_from_kernel(struct nl_sock *sock)
     if (!DeviceIoControl(sock->handle, OVS_IOCTL_GET_PID,
                          NULL, 0, &pid, sizeof(pid),
                          &bytes, NULL)) {
+        lost_communication(GetLastError());
         retval = EINVAL;
     } else {
         if (bytes < sizeof(pid)) {
@@ -535,11 +550,12 @@  nl_sock_send__(struct nl_sock *sock, const struct ofpbuf *msg,
         if (!DeviceIoControl(sock->handle, OVS_IOCTL_WRITE,
                              msg->data, msg->size, NULL, 0,
                              &bytes, NULL)) {
+            lost_communication(GetLastError());
             retval = -1;
             /* XXX: Map to a more appropriate error based on GetLastError(). */
             errno = EINVAL;
             VLOG_DBG_RL(&rl, "fatal driver failure in write: %s",
-                ovs_lasterror_to_string());
+                        ovs_lasterror_to_string());
         } else {
             retval = msg->size;
         }
@@ -629,6 +645,7 @@  nl_sock_recv__(struct nl_sock *sock, struct ofpbuf *buf, bool wait)
         DWORD bytes;
         if (!DeviceIoControl(sock->handle, sock->read_ioctl,
                              NULL, 0, tail, sizeof tail, &bytes, NULL)) {
+            lost_communication(GetLastError());
             VLOG_DBG_RL(&rl, "fatal driver failure in transact: %s",
                         ovs_lasterror_to_string());
             retval = -1;
@@ -879,6 +896,7 @@  nl_sock_transact_multiple__(struct nl_sock *sock,
             }
         } else if (!ret) {
             /* XXX: Map to a more appropriate error. */
+            lost_communication(GetLastError());
             error = EINVAL;
             VLOG_DBG_RL(&rl, "fatal driver failure: %s",
                 ovs_lasterror_to_string());
@@ -1275,6 +1293,7 @@  pend_io_request(struct nl_sock *sock)
         error = GetLastError();
         /* Check if the I/O got pended */
         if (error != ERROR_IO_INCOMPLETE && error != ERROR_IO_PENDING) {
+            lost_communication(error);
             VLOG_ERR("nl_sock_wait failed - %s\n", ovs_format_message(error));
             retval = EINVAL;
         }