diff mbox series

[ovs-dev,v3] ovsdb-cs: Perform forced reconnects without a backoff.

Message ID 20210721125100.12720-1-dceara@redhat.com
State Accepted
Headers show
Series [ovs-dev,v3] ovsdb-cs: Perform forced reconnects without a backoff. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot success github build: passed

Commit Message

Dumitru Ceara July 21, 2021, 12:51 p.m. UTC
The ovsdb-cs layer triggers a forced reconnect in various cases:
- when an inconsistency is detected in the data received from the
  remote server.
- when the remote server is running in clustered mode and transitioned
  to "follower", if the client is configured in "leader-only" mode.
- when explicitly requested by upper layers (e.g., by the user
  application, through the IDL layer).

In such cases it's desirable that reconnection should happen as fast as
possible, without the current exponential backoff maintained by the
underlying reconnect object.  Furthermore, since 3c2d6274bcee ("raft:
Transfer leadership before creating snapshots."), leadership changes
inside the clustered database happen more often and, therefore,
"leader-only" clients need to reconnect more often too.

Forced reconnects call jsonrpc_session_force_reconnect() which will not
reset backoff.  To make sure clients reconnect as fast as possible in
the afore mentioned scenarios we first call the new API,
jsonrpc_session_reset_backoff(), in ovsdb-cs, for sessions that are in
state CS_S_MONITORING (i.e., the remote is likely still alive and
functioning fine).

jsonrpc_session_reset_backoff() resets the number of backoff-free
reconnect retries to the number of remotes configured for the session,
ensuring that all remotes are retried exactly once with backoff 0.

This commit also updates the Python IDL and jsonrpc implementations.
The Python IDL wasn't tracking the IDL_S_MONITORING state explicitly,
we now do that too.  Tests were also added to make sure the IDL forced
reconnects happen without backoff.

Reported-at: https://bugzilla.redhat.com/1977264
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
v3:
- Addressed Ilya's comments:
  - rephrase commit log.
  - fix jsonrpc_session_reset_backoff() comment.
  - add extra backoff-free try only to connected jsonrpc sessions.
  - fix comment for OVSDB_CHECK_CLUSTER_IDL_PY().

v2:
https://patchwork.ozlabs.org/project/openvswitch/patch/20210721105614.25498-1-dceara@redhat.com/
- Reworked the patch based on the discussion with Ilya:
  - add a new jsonrpc_session_reset_backoff() API to be used by the
    ovsdb-cs layer to avoid backoffs when reconnecting due to
    inconsistent data/remote becoming follower.
v1:
https://patchwork.ozlabs.org/project/openvswitch/patch/20210629112035.17402-1-dceara@redhat.com/
---
 lib/jsonrpc.c         | 18 +++++++++++
 lib/jsonrpc.h         |  1 +
 lib/ovsdb-cs.c        | 10 ++++++
 python/ovs/db/idl.py  | 11 +++++++
 python/ovs/jsonrpc.py | 13 ++++++++
 tests/ovsdb-idl.at    | 74 ++++++++++++++++++++++++++++++++++++++++---
 6 files changed, 123 insertions(+), 4 deletions(-)

Comments

Ilya Maximets July 23, 2021, 4:39 p.m. UTC | #1
On 7/21/21 2:51 PM, Dumitru Ceara wrote:
> The ovsdb-cs layer triggers a forced reconnect in various cases:
> - when an inconsistency is detected in the data received from the
>   remote server.
> - when the remote server is running in clustered mode and transitioned
>   to "follower", if the client is configured in "leader-only" mode.
> - when explicitly requested by upper layers (e.g., by the user
>   application, through the IDL layer).
> 
> In such cases it's desirable that reconnection should happen as fast as
> possible, without the current exponential backoff maintained by the
> underlying reconnect object.  Furthermore, since 3c2d6274bcee ("raft:
> Transfer leadership before creating snapshots."), leadership changes
> inside the clustered database happen more often and, therefore,
> "leader-only" clients need to reconnect more often too.
> 
> Forced reconnects call jsonrpc_session_force_reconnect() which will not
> reset backoff.  To make sure clients reconnect as fast as possible in
> the afore mentioned scenarios we first call the new API,
> jsonrpc_session_reset_backoff(), in ovsdb-cs, for sessions that are in
> state CS_S_MONITORING (i.e., the remote is likely still alive and
> functioning fine).
> 
> jsonrpc_session_reset_backoff() resets the number of backoff-free
> reconnect retries to the number of remotes configured for the session,
> ensuring that all remotes are retried exactly once with backoff 0.
> 
> This commit also updates the Python IDL and jsonrpc implementations.
> The Python IDL wasn't tracking the IDL_S_MONITORING state explicitly,
> we now do that too.  Tests were also added to make sure the IDL forced
> reconnects happen without backoff.
> 
> Reported-at: https://bugzilla.redhat.com/1977264
> Suggested-by: Ilya Maximets <i.maximets@ovn.org>
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> ---
> v3:
> - Addressed Ilya's comments:
>   - rephrase commit log.
>   - fix jsonrpc_session_reset_backoff() comment.
>   - add extra backoff-free try only to connected jsonrpc sessions.
>   - fix comment for OVSDB_CHECK_CLUSTER_IDL_PY().
> 
> v2:
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210721105614.25498-1-dceara@redhat.com/
> - Reworked the patch based on the discussion with Ilya:
>   - add a new jsonrpc_session_reset_backoff() API to be used by the
>     ovsdb-cs layer to avoid backoffs when reconnecting due to
>     inconsistent data/remote becoming follower.
> v1:
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210629112035.17402-1-dceara@redhat.com/
> ---


Thanks!  Applied to master.  I also backported this down to 2.15
as it's the likely version stable versions of OVN are built with.

Best regards, Ilya Maximets.
Dumitru Ceara July 23, 2021, 5:04 p.m. UTC | #2
On 7/23/21 6:39 PM, Ilya Maximets wrote:
> On 7/21/21 2:51 PM, Dumitru Ceara wrote:
>> The ovsdb-cs layer triggers a forced reconnect in various cases:
>> - when an inconsistency is detected in the data received from the
>>   remote server.
>> - when the remote server is running in clustered mode and transitioned
>>   to "follower", if the client is configured in "leader-only" mode.
>> - when explicitly requested by upper layers (e.g., by the user
>>   application, through the IDL layer).
>>
>> In such cases it's desirable that reconnection should happen as fast as
>> possible, without the current exponential backoff maintained by the
>> underlying reconnect object.  Furthermore, since 3c2d6274bcee ("raft:
>> Transfer leadership before creating snapshots."), leadership changes
>> inside the clustered database happen more often and, therefore,
>> "leader-only" clients need to reconnect more often too.
>>
>> Forced reconnects call jsonrpc_session_force_reconnect() which will not
>> reset backoff.  To make sure clients reconnect as fast as possible in
>> the afore mentioned scenarios we first call the new API,
>> jsonrpc_session_reset_backoff(), in ovsdb-cs, for sessions that are in
>> state CS_S_MONITORING (i.e., the remote is likely still alive and
>> functioning fine).
>>
>> jsonrpc_session_reset_backoff() resets the number of backoff-free
>> reconnect retries to the number of remotes configured for the session,
>> ensuring that all remotes are retried exactly once with backoff 0.
>>
>> This commit also updates the Python IDL and jsonrpc implementations.
>> The Python IDL wasn't tracking the IDL_S_MONITORING state explicitly,
>> we now do that too.  Tests were also added to make sure the IDL forced
>> reconnects happen without backoff.
>>
>> Reported-at: https://bugzilla.redhat.com/1977264
>> Suggested-by: Ilya Maximets <i.maximets@ovn.org>
>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> ---
>> v3:
>> - Addressed Ilya's comments:
>>   - rephrase commit log.
>>   - fix jsonrpc_session_reset_backoff() comment.
>>   - add extra backoff-free try only to connected jsonrpc sessions.
>>   - fix comment for OVSDB_CHECK_CLUSTER_IDL_PY().
>>
>> v2:
>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210721105614.25498-1-dceara@redhat.com/
>> - Reworked the patch based on the discussion with Ilya:
>>   - add a new jsonrpc_session_reset_backoff() API to be used by the
>>     ovsdb-cs layer to avoid backoffs when reconnecting due to
>>     inconsistent data/remote becoming follower.
>> v1:
>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210629112035.17402-1-dceara@redhat.com/
>> ---
> 
> 
> Thanks!  Applied to master.  I also backported this down to 2.15
> as it's the likely version stable versions of OVN are built with.
> 
> Best regards, Ilya Maximets.
> 

Thanks!
diff mbox series

Patch

diff --git a/lib/jsonrpc.c b/lib/jsonrpc.c
index 7e333ae25f..c8ce5362e1 100644
--- a/lib/jsonrpc.c
+++ b/lib/jsonrpc.c
@@ -1280,6 +1280,24 @@  jsonrpc_session_force_reconnect(struct jsonrpc_session *s)
     reconnect_force_reconnect(s->reconnect, time_msec());
 }
 
+/* Resets the reconnect backoff for 's' by allowing as many free tries as the
+ * number of configured remotes.  This is to be used by upper layers before
+ * calling jsonrpc_session_force_reconnect() if backoff is undesirable.
+ */
+void
+jsonrpc_session_reset_backoff(struct jsonrpc_session *s)
+{
+    unsigned int free_tries = s->remotes.n;
+
+    if (jsonrpc_session_is_connected(s)) {
+        /* The extra free try will be consumed when the current remote
+         * is disconnected.
+         */
+        free_tries++;
+    }
+    reconnect_set_backoff_free_tries(s->reconnect, free_tries);
+}
+
 /* Sets 'max_backoff' as the maximum time, in milliseconds, to wait after a
  * connection attempt fails before attempting to connect again. */
 void
diff --git a/lib/jsonrpc.h b/lib/jsonrpc.h
index 034a30b716..2aa97d3fe6 100644
--- a/lib/jsonrpc.h
+++ b/lib/jsonrpc.h
@@ -137,6 +137,7 @@  void jsonrpc_session_get_reconnect_stats(const struct jsonrpc_session *,
 
 void jsonrpc_session_enable_reconnect(struct jsonrpc_session *);
 void jsonrpc_session_force_reconnect(struct jsonrpc_session *);
+void jsonrpc_session_reset_backoff(struct jsonrpc_session *);
 
 void jsonrpc_session_set_max_backoff(struct jsonrpc_session *,
                                      int max_backoff);
diff --git a/lib/ovsdb-cs.c b/lib/ovsdb-cs.c
index f13065c6c0..659d49dbf7 100644
--- a/lib/ovsdb-cs.c
+++ b/lib/ovsdb-cs.c
@@ -729,6 +729,16 @@  void
 ovsdb_cs_force_reconnect(struct ovsdb_cs *cs)
 {
     if (cs->session) {
+        if (cs->state == CS_S_MONITORING) {
+            /* The ovsdb-cs was in MONITORING state, so we either had data
+             * inconsistency on this server, or it stopped being the cluster
+             * leader, or the user requested to re-connect.  Avoiding backoff
+             * in these cases, as we need to re-connect as soon as possible.
+             * Connections that are not in MONITORING state should have their
+             * backoff to avoid constant flood of re-connection attempts in
+             * case there is no suitable database server. */
+            jsonrpc_session_reset_backoff(cs->session);
+        }
         jsonrpc_session_force_reconnect(cs->session);
     }
 }
diff --git a/python/ovs/db/idl.py b/python/ovs/db/idl.py
index f24511720f..ecae5e1432 100644
--- a/python/ovs/db/idl.py
+++ b/python/ovs/db/idl.py
@@ -102,6 +102,7 @@  class Idl(object):
     IDL_S_SERVER_MONITOR_REQUESTED = 2
     IDL_S_DATA_MONITOR_REQUESTED = 3
     IDL_S_DATA_MONITOR_COND_REQUESTED = 4
+    IDL_S_MONITORING = 5
 
     def __init__(self, remote, schema_helper, probe_interval=None,
                  leader_only=True):
@@ -295,6 +296,7 @@  class Idl(object):
                     else:
                         assert self.state == self.IDL_S_DATA_MONITOR_REQUESTED
                         self.__parse_update(msg.result, OVSDB_UPDATE)
+                    self.state = self.IDL_S_MONITORING
 
                 except error.Error as e:
                     vlog.err("%s: parse error in received schema: %s"
@@ -442,6 +444,15 @@  class Idl(object):
     def force_reconnect(self):
         """Forces the IDL to drop its connection to the database and reconnect.
         In the meantime, the contents of the IDL will not change."""
+        if self.state == self.IDL_S_MONITORING:
+            # The IDL was in MONITORING state, so we either had data
+            # inconsistency on this server, or it stopped being the cluster
+            # leader, or the user requested to re-connect.  Avoiding backoff
+            # in these cases, as we need to re-connect as soon as possible.
+            # Connections that are not in MONITORING state should have their
+            # backoff to avoid constant flood of re-connection attempts in
+            # case there is no suitable database server.
+            self._session.reset_backoff()
         self._session.force_reconnect()
 
     def session_name(self):
diff --git a/python/ovs/jsonrpc.py b/python/ovs/jsonrpc.py
index bf32f8c87c..d5127268aa 100644
--- a/python/ovs/jsonrpc.py
+++ b/python/ovs/jsonrpc.py
@@ -612,5 +612,18 @@  class Session(object):
     def force_reconnect(self):
         self.reconnect.force_reconnect(ovs.timeval.msec())
 
+    def reset_backoff(self):
+        """ Resets the reconnect backoff by allowing as many free tries as the
+        number of configured remotes.  This is to be used by upper layers
+        before calling force_reconnect() if backoff is undesirable."""
+        free_tries = len(self.remotes)
+
+        if self.is_connected():
+            # The extra free try will be consumed when the current remote
+            # is disconnected.
+            free_tries += 1
+
+        self.reconnect.set_backoff_free_tries(free_tries)
+
     def get_num_of_remotes(self):
         return len(self.remotes)
diff --git a/tests/ovsdb-idl.at b/tests/ovsdb-idl.at
index e32f9ec895..1386f13770 100644
--- a/tests/ovsdb-idl.at
+++ b/tests/ovsdb-idl.at
@@ -2227,11 +2227,29 @@  m4_define([OVSDB_CHECK_IDL_LEADER_ONLY_PY],
 OVSDB_CHECK_IDL_LEADER_ONLY_PY([Check Python IDL connects to leader], 3, ['remote'])
 OVSDB_CHECK_IDL_LEADER_ONLY_PY([Check Python IDL reconnects to leader], 3, ['remote' '+remotestop' 'remote'])
 
-# same as OVSDB_CHECK_IDL but uses C IDL implementation with tcp
-# with multiple remotes.
+# OVSDB_CHECK_CLUSTER_IDL_C(TITLE, N_SERVERS, [PRE-IDL-TXN], TRANSACTIONS,
+#                           OUTPUT, [KEYWORDS], [FILTER], [LOG_FILTER])
+#
+# Creates a clustered database with a schema derived from idltest.ovsidl, runs
+# each PRE-IDL-TXN (if any), starts N_SERVERS ovsdb-server instances in RAFT,
+# on that database, and runs "test-ovsdb idl" passing each of the TRANSACTIONS
+# along.
+#
+# Checks that the overall output is OUTPUT.  Before comparison, the
+# output is sorted (using "sort") and UUIDs in the output are replaced
+# by markers of the form <N> where N is a number.  The first unique
+# UUID is replaced by <0>, the next by <1>, and so on.  If a given
+# UUID appears more than once it is always replaced by the same
+# marker.  If FILTER is supplied then the output is also filtered
+# through the specified program.
+#
+# TITLE is provided to AT_SETUP and KEYWORDS to AT_KEYWORDS.
+#
+# If LOG_FILTER is provided, checks that the contents of LOG_FILTER
+# are not matched by grep in the test-ovsdb logs.
 m4_define([OVSDB_CHECK_CLUSTER_IDL_C],
   [AT_SETUP([$1 - C - tcp])
-   AT_KEYWORDS([ovsdb server idl positive tcp socket $5])
+   AT_KEYWORDS([ovsdb server idl tcp $6])
    m4_define([LPBK],[127.0.0.1])
    OVSDB_CLUSTER_START_IDLTEST([$2], ["ptcp:0:"LPBK])
    PARSE_LISTENING_PORT([s1.log], [TCP_PORT_1])
@@ -2242,11 +2260,36 @@  m4_define([OVSDB_CHECK_CLUSTER_IDL_C],
    m4_if([$3], [], [],
      [AT_CHECK([ovsdb-client transact $remotes $3], [0], [ignore], [ignore])])
    AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc -t10 idl tcp:LPBK:$TCP_PORT_1 $4],
-            [0], [stdout], [ignore])
+            [0], [stdout], [stderr])
+   AT_CHECK([sort stdout | uuidfilt]m4_if([$7],,, [[| $7]]),
+            [0], [$5])
+   m4_ifval([$8], [AT_CHECK([grep '$8' stderr], [1])], [], [])
+   AT_CLEANUP])
+
+# Same as OVSDB_CHECK_CLUSTER_IDL_C but uses the Python IDL implementation.
+m4_define([OVSDB_CHECK_CLUSTER_IDL_PY],
+  [AT_SETUP([$1 - Python3 - tcp])
+   AT_KEYWORDS([ovsdb server idl tcp $6])
+   m4_define([LPBK],[127.0.0.1])
+   OVSDB_CLUSTER_START_IDLTEST([$2], ["ptcp:0:"LPBK])
+   PARSE_LISTENING_PORT([s1.log], [TCP_PORT_1])
+   PARSE_LISTENING_PORT([s2.log], [TCP_PORT_2])
+   PARSE_LISTENING_PORT([s3.log], [TCP_PORT_3])
+   remotes=tcp:LPBK:$TCP_PORT_1,tcp:LPBK:$TCP_PORT_2,tcp:LPBK:$TCP_PORT_3
+
+   m4_if([$3], [], [],
+     [AT_CHECK([ovsdb-client transact $remotes $3], [0], [ignore], [ignore])])
+   AT_CHECK([$PYTHON3 $srcdir/test-ovsdb.py -t10 idl $srcdir/idltest.ovsschema tcp:LPBK:$TCP_PORT_1 $4],
+            [0], [stdout], [stderr])
    AT_CHECK([sort stdout | uuidfilt]m4_if([$7],,, [[| $7]]),
             [0], [$5])
+   m4_if([$8], [AT_CHECK([grep '$8' stderr], [1])], [], [])
    AT_CLEANUP])
 
+m4_define([OVSDB_CHECK_CLUSTER_IDL],
+  [OVSDB_CHECK_CLUSTER_IDL_C($@)
+   OVSDB_CHECK_CLUSTER_IDL_PY($@)])
+
 # Checks that monitor_cond_since works fine when disconnects happen
 # with cond_change requests in flight (i.e., IDL is properly updated).
 OVSDB_CHECK_CLUSTER_IDL_C([simple idl, monitor_cond_since, cluster disconnect],
@@ -2306,3 +2349,26 @@  OVSDB_CHECK_IDL_C([simple idl, initially empty, set remotes],
 009: empty
 010: done
 ]])
+
+dnl This test checks that forceful reconnects triggered by the IDL
+dnl happen immediately (they should not use backoff).
+OVSDB_CHECK_CLUSTER_IDL([simple idl, initially empty, force reconnect],
+  3,
+  [],
+  [['+reconnect' \
+    'reconnect' \
+    'reconnect' \
+    'reconnect']],
+  [[000: reconnect
+001: empty
+002: reconnect
+003: empty
+004: reconnect
+005: empty
+006: reconnect
+007: empty
+008: done
+]],
+[],
+[],
+reconnect.*waiting .* seconds before reconnect)