diff mbox series

[ovs-dev] TCP Stream: Use TCP keepalive by default

Message ID 20211025143632.1912999-1-msantana@redhat.com
State Changes Requested
Headers show
Series [ovs-dev] TCP Stream: Use TCP keepalive by default | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Michael Santana Oct. 25, 2021, 2:36 p.m. UTC
In the case that a client disables jsonrpc probes the client would fail
to detect if the connection to the server has dropped. To workaround
such case TCP keepalive is enabled.

Signed-off-by: Michael Santana <msantana@redhat.com>
---
 lib/socket-util.c | 13 +++++++++++++
 lib/socket-util.h |  1 +
 lib/stream-fd.c   |  1 +
 lib/stream-ssl.c  |  2 ++
 lib/stream-tcp.c  |  1 +
 5 files changed, 18 insertions(+)

Comments

Flavio Leitner Oct. 27, 2021, 1:25 p.m. UTC | #1
On Mon, Oct 25, 2021 at 10:36:32AM -0400, Michael Santana wrote:
> In the case that a client disables jsonrpc probes the client would fail
> to detect if the connection to the server has dropped. To workaround
> such case TCP keepalive is enabled.
> 
> Signed-off-by: Michael Santana <msantana@redhat.com>
> ---

Patch looks good and makes sense to me.
It builds fine on Windows as well.

Acked-by: Flavio Leitner <fbl@sysclose.org>

fbl
Ilya Maximets Nov. 16, 2021, 8:54 p.m. UTC | #2
On 10/25/21 16:36, Michael Santana wrote:
> In the case that a client disables jsonrpc probes the client would fail
> to detect if the connection to the server has dropped. To workaround
> such case TCP keepalive is enabled.
> 
> Signed-off-by: Michael Santana <msantana@redhat.com>
> ---

Hi, Michael.  Thanks for the patch.  But I'm not sure why we need this,
at least in current form.

Standard keepalive configuration on modern systems is set to something
around 2 hours most of the time.  So, the user might have 2 hours of
downtime and not even notice.

TCP keepalives might be useful for the case where user knows that
application may not reply for a long time, so they have to set the
inactivity probe to a higher value.   In this case, we could detect
connection failure with TCP keepalive using shorter time interval.

Having TCP keepalive configured to a very long interval (which is system
default), IMHO, doesn't make a lot of sense.

I would also argue that inactivity probes should never be disabled, but
set to a higher value instead, because TCP keepalive will not be able
to detect hanged application, e.g. deadlock.

Best regards, Ilya Maximets.
Flavio Leitner Nov. 23, 2021, 5:44 p.m. UTC | #3
On Tue, Nov 16, 2021 at 09:54:54PM +0100, Ilya Maximets wrote:
> On 10/25/21 16:36, Michael Santana wrote:
> > In the case that a client disables jsonrpc probes the client would fail
> > to detect if the connection to the server has dropped. To workaround
> > such case TCP keepalive is enabled.
> > 
> > Signed-off-by: Michael Santana <msantana@redhat.com>
> > ---
> 
> Hi, Michael.  Thanks for the patch.  But I'm not sure why we need this,
> at least in current form.
> 
> Standard keepalive configuration on modern systems is set to something
> around 2 hours most of the time.  So, the user might have 2 hours of
> downtime and not even notice.
> 
> TCP keepalives might be useful for the case where user knows that
> application may not reply for a long time, so they have to set the
> inactivity probe to a higher value.   In this case, we could detect
> connection failure with TCP keepalive using shorter time interval.
> 
> Having TCP keepalive configured to a very long interval (which is system
> default), IMHO, doesn't make a lot of sense.
> 
> I would also argue that inactivity probes should never be disabled, but
> set to a higher value instead, because TCP keepalive will not be able
> to detect hanged application, e.g. deadlock.

That's a good point. My concern is that either we will need
to stop allowing setting to 0 and that can break upgrades,
or we need to silently set a big interval instead, which is
not a good program behavior IMHO.

Also, it sounds like not allowing to disable actually prevents
the user to shoot himself in the foot, but I don't know for
sure if that has a valid use-case.

Even if we decide to improve the built-in inactivity probes,
it still seems like a good idea to enable TCP keepalive as
a second layer of fault detection.

BTW, the patch missed the tag below
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1988461
diff mbox series

Patch

diff --git a/lib/socket-util.c b/lib/socket-util.c
index 4f1ffecf5..dbf35f9bb 100644
--- a/lib/socket-util.c
+++ b/lib/socket-util.c
@@ -114,6 +114,19 @@  setsockopt_tcp_nodelay(int fd)
     }
 }
 
+void
+setsockopt_tcp_keepalive(int fd)
+{
+    int on = 1;
+    int retval;
+
+    retval = setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &on, sizeof on);
+    if (retval) {
+        retval = sock_errno();
+        VLOG_ERR("setsockopt(SO_KEEPALIVE): %s", sock_strerror(retval));
+    }
+}
+
 /* Sets the DSCP value of socket 'fd' to 'dscp', which must be 63 or less.
  * 'family' must indicate the socket's address family (AF_INET or AF_INET6, to
  * do anything useful). */
diff --git a/lib/socket-util.h b/lib/socket-util.h
index 9ccb7d4cc..f6cf00a32 100644
--- a/lib/socket-util.h
+++ b/lib/socket-util.h
@@ -33,6 +33,7 @@  struct ds;
 int set_nonblocking(int fd);
 void xset_nonblocking(int fd);
 void setsockopt_tcp_nodelay(int fd);
+void setsockopt_tcp_keepalive(int fd);
 int set_dscp(int fd, int family, uint8_t dscp);
 
 bool addr_is_ipv6(const char *host_name);
diff --git a/lib/stream-fd.c b/lib/stream-fd.c
index 46ee7ae27..df609579c 100644
--- a/lib/stream-fd.c
+++ b/lib/stream-fd.c
@@ -93,6 +93,7 @@  fd_connect(struct stream *stream)
     int retval = check_connection_completion(s->fd);
     if (retval == 0 && s->fd_type == AF_INET) {
         setsockopt_tcp_nodelay(s->fd);
+        setsockopt_tcp_keepalive(s->fd);
     }
     return retval;
 }
diff --git a/lib/stream-ssl.c b/lib/stream-ssl.c
index 0ea3f2c08..0a47208da 100644
--- a/lib/stream-ssl.c
+++ b/lib/stream-ssl.c
@@ -267,6 +267,7 @@  new_ssl_stream(char *name, char *server_name, int fd, enum session_type type,
      */
     if (state == STATE_SSL_CONNECTING) {
         setsockopt_tcp_nodelay(fd);
+        setsockopt_tcp_keepalive(fd);
     }
 
     /* Create and configure OpenSSL stream. */
@@ -521,6 +522,7 @@  ssl_connect(struct stream *stream)
         }
         sslv->state = STATE_SSL_CONNECTING;
         setsockopt_tcp_nodelay(sslv->fd);
+        setsockopt_tcp_keepalive(sslv->fd);
         /* Fall through. */
 
     case STATE_SSL_CONNECTING:
diff --git a/lib/stream-tcp.c b/lib/stream-tcp.c
index e8dc2bfaa..48443b9be 100644
--- a/lib/stream-tcp.c
+++ b/lib/stream-tcp.c
@@ -43,6 +43,7 @@  new_tcp_stream(char *name, int fd, int connect_status, struct stream **streamp)
 {
     if (connect_status == 0) {
         setsockopt_tcp_nodelay(fd);
+        setsockopt_tcp_keepalive(fd);
     }
 
     return new_fd_stream(name, fd, connect_status, AF_INET, streamp);