diff mbox series

[ovs-dev] reconnect: Add graceful reconnect.

Message ID 20210629112035.17402-1-dceara@redhat.com
State Changes Requested
Headers show
Series [ovs-dev] reconnect: Add graceful reconnect. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot success github build: passed

Commit Message

Dumitru Ceara June 29, 2021, 11:20 a.m. UTC
Until now clients that needed to reconnect immediately could only use
reconnect_force_reconnect().  However, reconnect_force_reconnect()
doesn't reset the backoff for connections that were alive long enough
(more than backoff seconds).

Moreover, the reconnect library cannot determine the exact reason why a
client wishes to initiate a reconnection.  In most cases reconnection
happens because of a fatal error when communicating with the remote,
e.g., in the ovsdb-cs layer, when invalid messages are received from
ovsdb-server.  In such cases it makes sense to not reset the backoff
because the remote seems to be unhealthy.

There are however cases when reconnection is needed for other reasons.
One such example is when ovsdb-clients require "leader-only" connections
to clustered ovsdb-server databases.  Whenever the client determines
that the remote is not a leader anymore, it decides to reconnect to a
new remote from its list, searching for the new leader.  Using
jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
not reset backoff even though the former leader is still likely in good
shape.

Since 3c2d6274bcee ("raft: Transfer leadership before creating
snapshots.") leadership changes inside the clustered database happen
more often and therefore "leader-only" clients need to reconnect more
often too.  Not resetting the backoff every time a leadership change
happens will cause all reconnections to happen with the maximum backoff
(8 seconds) resulting in significant latency.

This commit also updates the Python reconnect and IDL implementations
and adds tests for force-reconnect and graceful-reconnect.

Reported-at: https://bugzilla.redhat.com/1977264
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
 lib/jsonrpc.c           |   7 +
 lib/jsonrpc.h           |   1 +
 lib/ovsdb-cs.c          |  32 ++--
 lib/reconnect.c         |  36 ++++-
 lib/reconnect.h         |   1 +
 python/ovs/db/idl.py    |   8 +-
 python/ovs/jsonrpc.py   |   3 +
 python/ovs/reconnect.py |  34 +++-
 tests/reconnect.at      | 344 ++++++++++++++++++++++++++++++++++++++++
 tests/test-reconnect.c  |   7 +
 tests/test-reconnect.py |   5 +
 11 files changed, 454 insertions(+), 24 deletions(-)

Comments

Ben Pfaff June 29, 2021, 4:02 p.m. UTC | #1
On Tue, Jun 29, 2021 at 01:20:35PM +0200, Dumitru Ceara wrote:
> Until now clients that needed to reconnect immediately could only use
> reconnect_force_reconnect().  However, reconnect_force_reconnect()
> doesn't reset the backoff for connections that were alive long enough
> (more than backoff seconds).
> 
> Moreover, the reconnect library cannot determine the exact reason why a
> client wishes to initiate a reconnection.  In most cases reconnection
> happens because of a fatal error when communicating with the remote,
> e.g., in the ovsdb-cs layer, when invalid messages are received from
> ovsdb-server.  In such cases it makes sense to not reset the backoff
> because the remote seems to be unhealthy.
> 
> There are however cases when reconnection is needed for other reasons.
> One such example is when ovsdb-clients require "leader-only" connections
> to clustered ovsdb-server databases.  Whenever the client determines
> that the remote is not a leader anymore, it decides to reconnect to a
> new remote from its list, searching for the new leader.  Using
> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
> not reset backoff even though the former leader is still likely in good
> shape.
> 
> Since 3c2d6274bcee ("raft: Transfer leadership before creating
> snapshots.") leadership changes inside the clustered database happen
> more often and therefore "leader-only" clients need to reconnect more
> often too.  Not resetting the backoff every time a leadership change
> happens will cause all reconnections to happen with the maximum backoff
> (8 seconds) resulting in significant latency.
> 
> This commit also updates the Python reconnect and IDL implementations
> and adds tests for force-reconnect and graceful-reconnect.
> 
> Reported-at: https://bugzilla.redhat.com/1977264
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>

I only glanced over this, but my reaction is good.  Thank you for
adding tests and writing such a thorough rationale!
Ilya Maximets July 12, 2021, 8:36 p.m. UTC | #2
On 6/29/21 1:20 PM, Dumitru Ceara wrote:
> Until now clients that needed to reconnect immediately could only use
> reconnect_force_reconnect().  However, reconnect_force_reconnect()
> doesn't reset the backoff for connections that were alive long enough
> (more than backoff seconds).
> 
> Moreover, the reconnect library cannot determine the exact reason why a
> client wishes to initiate a reconnection.  In most cases reconnection
> happens because of a fatal error when communicating with the remote,
> e.g., in the ovsdb-cs layer, when invalid messages are received from
> ovsdb-server.  In such cases it makes sense to not reset the backoff
> because the remote seems to be unhealthy.
> 
> There are however cases when reconnection is needed for other reasons.
> One such example is when ovsdb-clients require "leader-only" connections
> to clustered ovsdb-server databases.  Whenever the client determines
> that the remote is not a leader anymore, it decides to reconnect to a
> new remote from its list, searching for the new leader.  Using
> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
> not reset backoff even though the former leader is still likely in good
> shape.
> 
> Since 3c2d6274bcee ("raft: Transfer leadership before creating
> snapshots.") leadership changes inside the clustered database happen
> more often and therefore "leader-only" clients need to reconnect more
> often too.  Not resetting the backoff every time a leadership change
> happens will cause all reconnections to happen with the maximum backoff
> (8 seconds) resulting in significant latency.
> 
> This commit also updates the Python reconnect and IDL implementations
> and adds tests for force-reconnect and graceful-reconnect.
> 
> Reported-at: https://bugzilla.redhat.com/1977264
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> ---

Hi, Dumitru.

Thanks for working on this issue.  I've seen it in practice while running
OVN tests, but I still don't quiet understand why it happens.  Could you,
please, describe how state transitioning work here for the ovsdb-idl case?

> +# Forcefully reconnect.
> +force-reconnect
> +  in RECONNECT for 0 ms (2000 ms backoff)
> +  1 successful connections out of 3 attempts, seqno 2
> +  disconnected
> +run
> +  should disconnect
> +connecting
> +  in CONNECTING for 0 ms (2000 ms backoff)

Especially this part seems wrong to me.  Because after 'should disconnect'
there should be 'disconnect' of 'connect-fail', but not 'connecting'.  We
literally should disconnect here, otherwise it's a violation of the reconnect
API.  And my concern is that ovsdb-cs or jsonrpc violates the API somewhere
by not calling reconnect_disconnectd() when it is required, or there is some
other bug that makes 'reconnect' module to jump over few states in a fsm.

The logical workflow for the force-reconnect, from what I see in the code
should be:

1. force-reconnect --> transition to S_RECONNECT
2. run -> in S_RECONNECT, so returning RECONNECT_DISCONNECT
3. disconnect -> check the state, update backoff and transition to S_BACKOFF
4. run -> in S_BACKOFF, so returning RECONNECT_CONNECT
5. connected ....

Something is fishy here, because ovsdb-cs somehow jumps over step #3 and
maybe also #4.

Best regards, Ilya Maximets.
Dumitru Ceara July 13, 2021, 3:03 p.m. UTC | #3
On 7/12/21 10:36 PM, Ilya Maximets wrote:
> On 6/29/21 1:20 PM, Dumitru Ceara wrote:
>> Until now clients that needed to reconnect immediately could only use
>> reconnect_force_reconnect().  However, reconnect_force_reconnect()
>> doesn't reset the backoff for connections that were alive long enough
>> (more than backoff seconds).
>>
>> Moreover, the reconnect library cannot determine the exact reason why a
>> client wishes to initiate a reconnection.  In most cases reconnection
>> happens because of a fatal error when communicating with the remote,
>> e.g., in the ovsdb-cs layer, when invalid messages are received from
>> ovsdb-server.  In such cases it makes sense to not reset the backoff
>> because the remote seems to be unhealthy.
>>
>> There are however cases when reconnection is needed for other reasons.
>> One such example is when ovsdb-clients require "leader-only" connections
>> to clustered ovsdb-server databases.  Whenever the client determines
>> that the remote is not a leader anymore, it decides to reconnect to a
>> new remote from its list, searching for the new leader.  Using
>> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
>> not reset backoff even though the former leader is still likely in good
>> shape.
>>
>> Since 3c2d6274bcee ("raft: Transfer leadership before creating
>> snapshots.") leadership changes inside the clustered database happen
>> more often and therefore "leader-only" clients need to reconnect more
>> often too.  Not resetting the backoff every time a leadership change
>> happens will cause all reconnections to happen with the maximum backoff
>> (8 seconds) resulting in significant latency.
>>
>> This commit also updates the Python reconnect and IDL implementations
>> and adds tests for force-reconnect and graceful-reconnect.
>>
>> Reported-at: https://bugzilla.redhat.com/1977264
>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> ---
> 
> Hi, Dumitru.
> 

Hi Ilya,

> Thanks for working on this issue.  I've seen it in practice while running
> OVN tests, but I still don't quiet understand why it happens.  Could you,
> please, describe how state transitioning work here for the ovsdb-idl case?
> 

Without the patch, assuming a current backoff of X seconds, the sequence
of events (even for a connection that has seen activity after backoff)
is:

- reconnect_force_reconnect() -> move to S_RECONNECT
- ovsdb_cs_run()
  -> jsonrpc_session_run()
     -> reconnect_run()
     -> reconnect_disconnected()
        -> because state is not S_ACTIVE | S_IDLE backoff is not
           changed, stays X.

>> +# Forcefully reconnect.
>> +force-reconnect
>> +  in RECONNECT for 0 ms (2000 ms backoff)
>> +  1 successful connections out of 3 attempts, seqno 2
>> +  disconnected
>> +run
>> +  should disconnect
>> +connecting
>> +  in CONNECTING for 0 ms (2000 ms backoff)
> 
> Especially this part seems wrong to me.  Because after 'should disconnect'
> there should be 'disconnect' of 'connect-fail', but not 'connecting'.  We
> literally should disconnect here, otherwise it's a violation of the reconnect
> API.  And my concern is that ovsdb-cs or jsonrpc violates the API somewhere
> by not calling reconnect_disconnectd() when it is required, or there is some
> other bug that makes 'reconnect' module to jump over few states in a fsm.
> 
> The logical workflow for the force-reconnect, from what I see in the code
> should be:
> 
> 1. force-reconnect --> transition to S_RECONNECT
> 2. run -> in S_RECONNECT, so returning RECONNECT_DISCONNECT
> 3. disconnect -> check the state, update backoff and transition to S_BACKOFF
> 4. run -> in S_BACKOFF, so returning RECONNECT_CONNECT
> 5. connected ....

This is just a bug in the test case I added, I need to issue
"disconnect" in the test case to trigger reconnect_disconnected()
to be called (like ovsdb-cs does).

> 
> Something is fishy here, because ovsdb-cs somehow jumps over step #3 and
> maybe also #4.

As mentioned above, it's not the case for ovsdb-cs.

However, while checking the test case I realized that the patch will
cause all "graceful" reconnects to happen with an initial backoff of 2
seconds for long lived sessions (instead of 1s).

That's because:

reconnect_graceful_reconnect()
-> reconnect_reset_backoff__()
   -> session was active after the initial 'backoff' seconds, so
      reset backoff to minimum.
   -> reconnect_transition__(..., S_RECONNECT);
ovsdb_cs_run()
-> jsonrpc_session_run()
   -> reconnect_run()
   -> reconnect_disconnected():
      -> if (!reconnect_reset_backoff__(..)) then double backoff.

I see a couple of potential ways to fix that:
1. reconnect_graceful_reconnect() could bump fsm->backoff_free_tries to
allow a single backoff free reconnect.
2. reconnect_graceful_reconnect() could reset fsm->backoff to 0 if it
was active recently (instead of calling reconnect_reset_backoff__()).

However, this is all under the assumption that we want to add support
for two types of reconnect "semantics":

a. forced, e.g., when the client detects inconsistencies in the data
received from the server (the server is in "bad shape") in which we
should never reset the backoff.

b. graceful, e.g., when the client reconnects because it needs a
leader-only connection and a leadership transfer happened on the
server side (the server is likely in "good shape" just not a leader
anymore).

An alternative to all this is to change reconnect_disconnected() to
reset backoff for connections in states S_ACTIVE | S_IDLE | S_RECONNECT
(used to be just active and idle).

I guess we could treat this as a bug fix and continue discussion for a
separate follow up patch to add the "graceful vs force" semantics if
they turn out to make sense for the future.

In essence (excluding potential python and test changes) the patch would
become:

diff --git a/lib/reconnect.c b/lib/reconnect.c
index a929ddfd2d..3a6d93f9d1 100644
--- a/lib/reconnect.c
+++ b/lib/reconnect.c
@@ -385,7 +385,7 @@ reconnect_disconnected(struct reconnect *fsm, long long int now, int error)
         if (fsm->backoff_free_tries > 1) {
             fsm->backoff_free_tries--;
             fsm->backoff = 0;
-        } else if (fsm->state & (S_ACTIVE | S_IDLE)
+        } else if (fsm->state & (S_ACTIVE | S_IDLE | S_RECONNECT)
                    && (fsm->last_activity - fsm->last_connected >= fsm->backoff
                        || fsm->passive)) {
             fsm->backoff = fsm->passive ? 0 : fsm->min_backoff;
---

What do you think?

> 
> Best regards, Ilya Maximets.
> 

Regards,
Dumitru
Ilya Maximets July 20, 2021, 1:18 p.m. UTC | #4
On 7/13/21 5:03 PM, Dumitru Ceara wrote:
> On 7/12/21 10:36 PM, Ilya Maximets wrote:
>> On 6/29/21 1:20 PM, Dumitru Ceara wrote:
>>> Until now clients that needed to reconnect immediately could only use
>>> reconnect_force_reconnect().  However, reconnect_force_reconnect()
>>> doesn't reset the backoff for connections that were alive long enough
>>> (more than backoff seconds).
>>>
>>> Moreover, the reconnect library cannot determine the exact reason why a
>>> client wishes to initiate a reconnection.  In most cases reconnection
>>> happens because of a fatal error when communicating with the remote,
>>> e.g., in the ovsdb-cs layer, when invalid messages are received from
>>> ovsdb-server.  In such cases it makes sense to not reset the backoff
>>> because the remote seems to be unhealthy.
>>>
>>> There are however cases when reconnection is needed for other reasons.
>>> One such example is when ovsdb-clients require "leader-only" connections
>>> to clustered ovsdb-server databases.  Whenever the client determines
>>> that the remote is not a leader anymore, it decides to reconnect to a
>>> new remote from its list, searching for the new leader.  Using
>>> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
>>> not reset backoff even though the former leader is still likely in good
>>> shape.
>>>
>>> Since 3c2d6274bcee ("raft: Transfer leadership before creating
>>> snapshots.") leadership changes inside the clustered database happen
>>> more often and therefore "leader-only" clients need to reconnect more
>>> often too.  Not resetting the backoff every time a leadership change
>>> happens will cause all reconnections to happen with the maximum backoff
>>> (8 seconds) resulting in significant latency.
>>>
>>> This commit also updates the Python reconnect and IDL implementations
>>> and adds tests for force-reconnect and graceful-reconnect.
>>>
>>> Reported-at: https://bugzilla.redhat.com/1977264
>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>> ---
>>
>> Hi, Dumitru.
>>
> 
> Hi Ilya,
> 
>> Thanks for working on this issue.  I've seen it in practice while running
>> OVN tests, but I still don't quiet understand why it happens.  Could you,
>> please, describe how state transitioning work here for the ovsdb-idl case?
>>
> 
> Without the patch, assuming a current backoff of X seconds, the sequence
> of events (even for a connection that has seen activity after backoff)
> is:
> 
> - reconnect_force_reconnect() -> move to S_RECONNECT
> - ovsdb_cs_run()
>   -> jsonrpc_session_run()
>      -> reconnect_run()
>      -> reconnect_disconnected()
>         -> because state is not S_ACTIVE | S_IDLE backoff is not
>            changed, stays X.

Hmm, I see.  Thanks for explanation!

> 
>>> +# Forcefully reconnect.
>>> +force-reconnect
>>> +  in RECONNECT for 0 ms (2000 ms backoff)
>>> +  1 successful connections out of 3 attempts, seqno 2
>>> +  disconnected
>>> +run
>>> +  should disconnect
>>> +connecting
>>> +  in CONNECTING for 0 ms (2000 ms backoff)
>>
>> Especially this part seems wrong to me.  Because after 'should disconnect'
>> there should be 'disconnect' of 'connect-fail', but not 'connecting'.  We
>> literally should disconnect here, otherwise it's a violation of the reconnect
>> API.  And my concern is that ovsdb-cs or jsonrpc violates the API somewhere
>> by not calling reconnect_disconnectd() when it is required, or there is some
>> other bug that makes 'reconnect' module to jump over few states in a fsm.
>>
>> The logical workflow for the force-reconnect, from what I see in the code
>> should be:
>>
>> 1. force-reconnect --> transition to S_RECONNECT
>> 2. run -> in S_RECONNECT, so returning RECONNECT_DISCONNECT
>> 3. disconnect -> check the state, update backoff and transition to S_BACKOFF
>> 4. run -> in S_BACKOFF, so returning RECONNECT_CONNECT
>> 5. connected ....
> 
> This is just a bug in the test case I added, I need to issue
> "disconnect" in the test case to trigger reconnect_disconnected()
> to be called (like ovsdb-cs does).
> 
>>
>> Something is fishy here, because ovsdb-cs somehow jumps over step #3 and
>> maybe also #4.
> 
> As mentioned above, it's not the case for ovsdb-cs.
> 
> However, while checking the test case I realized that the patch will
> cause all "graceful" reconnects to happen with an initial backoff of 2
> seconds for long lived sessions (instead of 1s).
> 
> That's because:
> 
> reconnect_graceful_reconnect()
> -> reconnect_reset_backoff__()
>    -> session was active after the initial 'backoff' seconds, so
>       reset backoff to minimum.
>    -> reconnect_transition__(..., S_RECONNECT);
> ovsdb_cs_run()
> -> jsonrpc_session_run()
>    -> reconnect_run()
>    -> reconnect_disconnected():
>       -> if (!reconnect_reset_backoff__(..)) then double backoff.
> 
> I see a couple of potential ways to fix that:
> 1. reconnect_graceful_reconnect() could bump fsm->backoff_free_tries to
> allow a single backoff free reconnect.
> 2. reconnect_graceful_reconnect() could reset fsm->backoff to 0 if it
> was active recently (instead of calling reconnect_reset_backoff__()).
> 
> However, this is all under the assumption that we want to add support
> for two types of reconnect "semantics":
> 
> a. forced, e.g., when the client detects inconsistencies in the data
> received from the server (the server is in "bad shape") in which we
> should never reset the backoff.
> 
> b. graceful, e.g., when the client reconnects because it needs a
> leader-only connection and a leadership transfer happened on the
> server side (the server is likely in "good shape" just not a leader
> anymore).
> 
> An alternative to all this is to change reconnect_disconnected() to
> reset backoff for connections in states S_ACTIVE | S_IDLE | S_RECONNECT
> (used to be just active and idle).
> 
> I guess we could treat this as a bug fix and continue discussion for a
> separate follow up patch to add the "graceful vs force" semantics if
> they turn out to make sense for the future.
> 
> In essence (excluding potential python and test changes) the patch would
> become:
> 
> diff --git a/lib/reconnect.c b/lib/reconnect.c
> index a929ddfd2d..3a6d93f9d1 100644
> --- a/lib/reconnect.c
> +++ b/lib/reconnect.c
> @@ -385,7 +385,7 @@ reconnect_disconnected(struct reconnect *fsm, long long int now, int error)
>          if (fsm->backoff_free_tries > 1) {
>              fsm->backoff_free_tries--;
>              fsm->backoff = 0;
> -        } else if (fsm->state & (S_ACTIVE | S_IDLE)
> +        } else if (fsm->state & (S_ACTIVE | S_IDLE | S_RECONNECT)
>                     && (fsm->last_activity - fsm->last_connected >= fsm->backoff
>                         || fsm->passive)) {
>              fsm->backoff = fsm->passive ? 0 : fsm->min_backoff;
> ---
> 
> What do you think?

The problem I see with adding S_RECONNECT to this condition is that
'last_activity' and 'last_connected' doesn't belong to current
connection in case we're forcefully interrupting connection in a
S_CONNECTING state.  So, we will re-set backoff based on outdated
information.  Also, I think, in this case, 'last_connected' might
be larger than 'last_activity' and we will have a negative value here.
All values are signed, so it should not be an issue, but it's not
a clean solution.

Bumping the number of free tries seems like a better solution to me,
because:

1. I'm not sure that we need 2 types of reconnect.  I mean, backoff
   is intended to avoid overloading the already overloaded server with
   connection attempts.  In case of forceful re-connection the server
   is not overloaded, so we should not increase a backoff.
   IMO, backoff should only be involved in cases where we have
   problems on the server side related to load, not the leader-only
   stuff or even database inconsistency problems.

2. force-reconnect should not consume free tries for exactly same
   reasons as in point 1.  Free tries are to avoid backoff, but
   backoff exists only to avoid storming the remote server with
   connections while it's overloaded.
   No overload -> No need for backoff -> No need to consume free tries.

3. With bumping of free tries we're avoiding logical issues around
   using 'last_activity' and 'last_connected' from the old connection.

4. Faster reconnection, since backoff will be zero.  And that is
   fine, because as far as we know, the server is not overloaded,
   it's just not suitable for our needs for some reason.
   If it is overloaded, we will backoff after the first failed
   connection attempt.

One potential problem I see with this solution is that if there
are no servers on a list that are suitable for ovsdb-cs, we will
have a log full of annoying reconnect logs.  Backoff kind of
rate-limits ovsdb-cs right now and we have no internal rate-limit
inside ovsdb-cs.  This looks more like an issue of ovsdb-cs and
not the issue of reconnect module itself.  But we need to have a
solution for this.

And actually, the problem here is that from the jsonrpc point of
view the connection is perfectly fine and functional, but ovsdb-cs
implements high-level logic on top of it and decides that
connection is not good on a much later stage based on the actual
received data.  So, we need to somehow propagate this information
from ovsdb-cs down to reconnect module or allow ovsdb-cs to control
the state machine, otherwise we will need two separate backoffs:
one inside the reconnect for usual connection problems and one
in ovsdb-cs for high-level data inconsistencies and leader changes.

Thinking this way led me to a different solution.  We could expose
something like jsonrpc_session_set_backoff_free_tries() and allow
ovsdb-cs to make a decision.  E.g.:

diff --git a/lib/ovsdb-cs.c b/lib/ovsdb-cs.c
index f13065c6c..900597b96 100644
--- a/lib/ovsdb-cs.c
+++ b/lib/ovsdb-cs.c
@@ -729,6 +729,17 @@ void
 ovsdb_cs_force_reconnect(struct ovsdb_cs *cs)
 {
     if (cs->session) {
+        if (cs->state == CS_S_MONITORING) {
+            /* The ovsdb-cs was in MONITORING state, so we either had data
+             * inconsistency on this server, or it stopped being the cluster
+             * leader, or the user requested to re-connect.  Avoiding backoff
+             * in these cases, as we need to re-connect as soon as possible.
+             * Connections that are not in MONITORING state should have their
+             * backoff to avoid constant flood of re-connection attempts in
+             * case there is no suitable database server. */
+            jsonrpc_session_set_backoff_free_tries(
+                cs->session, jsonrpc_session_get_n_remotes(cs->session));
+        }
         jsonrpc_session_force_reconnect(cs->session);
     }
 }
---

This way, if the server looses leadership or inconsistency detected,
the client will have 3 free attempts to find a new suitable server.
After that it will start to backoff as it does now.  No changes in
reconnect module required.

Thoughts?

Best regards, Ilya Maximets.
Dumitru Ceara July 20, 2021, 1:39 p.m. UTC | #5
On 7/20/21 3:18 PM, Ilya Maximets wrote:
> On 7/13/21 5:03 PM, Dumitru Ceara wrote:
>> On 7/12/21 10:36 PM, Ilya Maximets wrote:
>>> On 6/29/21 1:20 PM, Dumitru Ceara wrote:
>>>> Until now clients that needed to reconnect immediately could only use
>>>> reconnect_force_reconnect().  However, reconnect_force_reconnect()
>>>> doesn't reset the backoff for connections that were alive long enough
>>>> (more than backoff seconds).
>>>>
>>>> Moreover, the reconnect library cannot determine the exact reason why a
>>>> client wishes to initiate a reconnection.  In most cases reconnection
>>>> happens because of a fatal error when communicating with the remote,
>>>> e.g., in the ovsdb-cs layer, when invalid messages are received from
>>>> ovsdb-server.  In such cases it makes sense to not reset the backoff
>>>> because the remote seems to be unhealthy.
>>>>
>>>> There are however cases when reconnection is needed for other reasons.
>>>> One such example is when ovsdb-clients require "leader-only" connections
>>>> to clustered ovsdb-server databases.  Whenever the client determines
>>>> that the remote is not a leader anymore, it decides to reconnect to a
>>>> new remote from its list, searching for the new leader.  Using
>>>> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
>>>> not reset backoff even though the former leader is still likely in good
>>>> shape.
>>>>
>>>> Since 3c2d6274bcee ("raft: Transfer leadership before creating
>>>> snapshots.") leadership changes inside the clustered database happen
>>>> more often and therefore "leader-only" clients need to reconnect more
>>>> often too.  Not resetting the backoff every time a leadership change
>>>> happens will cause all reconnections to happen with the maximum backoff
>>>> (8 seconds) resulting in significant latency.
>>>>
>>>> This commit also updates the Python reconnect and IDL implementations
>>>> and adds tests for force-reconnect and graceful-reconnect.
>>>>
>>>> Reported-at: https://bugzilla.redhat.com/1977264
>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>>> ---
>>>
>>> Hi, Dumitru.
>>>
>>
>> Hi Ilya,
>>
>>> Thanks for working on this issue.  I've seen it in practice while running
>>> OVN tests, but I still don't quiet understand why it happens.  Could you,
>>> please, describe how state transitioning work here for the ovsdb-idl case?
>>>
>>
>> Without the patch, assuming a current backoff of X seconds, the sequence
>> of events (even for a connection that has seen activity after backoff)
>> is:
>>
>> - reconnect_force_reconnect() -> move to S_RECONNECT
>> - ovsdb_cs_run()
>>   -> jsonrpc_session_run()
>>      -> reconnect_run()
>>      -> reconnect_disconnected()
>>         -> because state is not S_ACTIVE | S_IDLE backoff is not
>>            changed, stays X.
> 
> Hmm, I see.  Thanks for explanation!
> 
>>
>>>> +# Forcefully reconnect.
>>>> +force-reconnect
>>>> +  in RECONNECT for 0 ms (2000 ms backoff)
>>>> +  1 successful connections out of 3 attempts, seqno 2
>>>> +  disconnected
>>>> +run
>>>> +  should disconnect
>>>> +connecting
>>>> +  in CONNECTING for 0 ms (2000 ms backoff)
>>>
>>> Especially this part seems wrong to me.  Because after 'should disconnect'
>>> there should be 'disconnect' of 'connect-fail', but not 'connecting'.  We
>>> literally should disconnect here, otherwise it's a violation of the reconnect
>>> API.  And my concern is that ovsdb-cs or jsonrpc violates the API somewhere
>>> by not calling reconnect_disconnectd() when it is required, or there is some
>>> other bug that makes 'reconnect' module to jump over few states in a fsm.
>>>
>>> The logical workflow for the force-reconnect, from what I see in the code
>>> should be:
>>>
>>> 1. force-reconnect --> transition to S_RECONNECT
>>> 2. run -> in S_RECONNECT, so returning RECONNECT_DISCONNECT
>>> 3. disconnect -> check the state, update backoff and transition to S_BACKOFF
>>> 4. run -> in S_BACKOFF, so returning RECONNECT_CONNECT
>>> 5. connected ....
>>
>> This is just a bug in the test case I added, I need to issue
>> "disconnect" in the test case to trigger reconnect_disconnected()
>> to be called (like ovsdb-cs does).
>>
>>>
>>> Something is fishy here, because ovsdb-cs somehow jumps over step #3 and
>>> maybe also #4.
>>
>> As mentioned above, it's not the case for ovsdb-cs.
>>
>> However, while checking the test case I realized that the patch will
>> cause all "graceful" reconnects to happen with an initial backoff of 2
>> seconds for long lived sessions (instead of 1s).
>>
>> That's because:
>>
>> reconnect_graceful_reconnect()
>> -> reconnect_reset_backoff__()
>>    -> session was active after the initial 'backoff' seconds, so
>>       reset backoff to minimum.
>>    -> reconnect_transition__(..., S_RECONNECT);
>> ovsdb_cs_run()
>> -> jsonrpc_session_run()
>>    -> reconnect_run()
>>    -> reconnect_disconnected():
>>       -> if (!reconnect_reset_backoff__(..)) then double backoff.
>>
>> I see a couple of potential ways to fix that:
>> 1. reconnect_graceful_reconnect() could bump fsm->backoff_free_tries to
>> allow a single backoff free reconnect.
>> 2. reconnect_graceful_reconnect() could reset fsm->backoff to 0 if it
>> was active recently (instead of calling reconnect_reset_backoff__()).
>>
>> However, this is all under the assumption that we want to add support
>> for two types of reconnect "semantics":
>>
>> a. forced, e.g., when the client detects inconsistencies in the data
>> received from the server (the server is in "bad shape") in which we
>> should never reset the backoff.
>>
>> b. graceful, e.g., when the client reconnects because it needs a
>> leader-only connection and a leadership transfer happened on the
>> server side (the server is likely in "good shape" just not a leader
>> anymore).
>>
>> An alternative to all this is to change reconnect_disconnected() to
>> reset backoff for connections in states S_ACTIVE | S_IDLE | S_RECONNECT
>> (used to be just active and idle).
>>
>> I guess we could treat this as a bug fix and continue discussion for a
>> separate follow up patch to add the "graceful vs force" semantics if
>> they turn out to make sense for the future.
>>
>> In essence (excluding potential python and test changes) the patch would
>> become:
>>
>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>> index a929ddfd2d..3a6d93f9d1 100644
>> --- a/lib/reconnect.c
>> +++ b/lib/reconnect.c
>> @@ -385,7 +385,7 @@ reconnect_disconnected(struct reconnect *fsm, long long int now, int error)
>>          if (fsm->backoff_free_tries > 1) {
>>              fsm->backoff_free_tries--;
>>              fsm->backoff = 0;
>> -        } else if (fsm->state & (S_ACTIVE | S_IDLE)
>> +        } else if (fsm->state & (S_ACTIVE | S_IDLE | S_RECONNECT)
>>                     && (fsm->last_activity - fsm->last_connected >= fsm->backoff
>>                         || fsm->passive)) {
>>              fsm->backoff = fsm->passive ? 0 : fsm->min_backoff;
>> ---
>>
>> What do you think?
> 
> The problem I see with adding S_RECONNECT to this condition is that
> 'last_activity' and 'last_connected' doesn't belong to current
> connection in case we're forcefully interrupting connection in a
> S_CONNECTING state.  So, we will re-set backoff based on outdated
> information.  Also, I think, in this case, 'last_connected' might
> be larger than 'last_activity' and we will have a negative value here.

You're right, I missed this.

> All values are signed, so it should not be an issue, but it's not
> a clean solution.

It's not, I agree.

> 
> Bumping the number of free tries seems like a better solution to me,
> because:
> 
> 1. I'm not sure that we need 2 types of reconnect.  I mean, backoff
>    is intended to avoid overloading the already overloaded server with
>    connection attempts.  In case of forceful re-connection the server
>    is not overloaded, so we should not increase a backoff.
>    IMO, backoff should only be involved in cases where we have
>    problems on the server side related to load, not the leader-only
>    stuff or even database inconsistency problems.
> 
> 2. force-reconnect should not consume free tries for exactly same
>    reasons as in point 1.  Free tries are to avoid backoff, but
>    backoff exists only to avoid storming the remote server with
>    connections while it's overloaded.
>    No overload -> No need for backoff -> No need to consume free tries.
> 
> 3. With bumping of free tries we're avoiding logical issues around
>    using 'last_activity' and 'last_connected' from the old connection.
> 
> 4. Faster reconnection, since backoff will be zero.  And that is
>    fine, because as far as we know, the server is not overloaded,
>    it's just not suitable for our needs for some reason.
>    If it is overloaded, we will backoff after the first failed
>    connection attempt.

This is even better (instead of min_backoff, I mean).

> 
> One potential problem I see with this solution is that if there
> are no servers on a list that are suitable for ovsdb-cs, we will
> have a log full of annoying reconnect logs.  Backoff kind of
> rate-limits ovsdb-cs right now and we have no internal rate-limit
> inside ovsdb-cs.  This looks more like an issue of ovsdb-cs and
> not the issue of reconnect module itself.  But we need to have a
> solution for this.
> 
> And actually, the problem here is that from the jsonrpc point of
> view the connection is perfectly fine and functional, but ovsdb-cs
> implements high-level logic on top of it and decides that
> connection is not good on a much later stage based on the actual
> received data.  So, we need to somehow propagate this information
> from ovsdb-cs down to reconnect module or allow ovsdb-cs to control
> the state machine, otherwise we will need two separate backoffs:

That's what I was trying to implement with the "graceful reconnect:
allow ovsdb-cs to decide how to reconnect.  But there are corner cases
like the ones pointed out during the review of this patch.

> one inside the reconnect for usual connection problems and one
> in ovsdb-cs for high-level data inconsistencies and leader changes.
> 
> Thinking this way led me to a different solution.  We could expose
> something like jsonrpc_session_set_backoff_free_tries() and allow
> ovsdb-cs to make a decision.  E.g.:
> 
> diff --git a/lib/ovsdb-cs.c b/lib/ovsdb-cs.c
> index f13065c6c..900597b96 100644
> --- a/lib/ovsdb-cs.c
> +++ b/lib/ovsdb-cs.c
> @@ -729,6 +729,17 @@ void
>  ovsdb_cs_force_reconnect(struct ovsdb_cs *cs)
>  {
>      if (cs->session) {
> +        if (cs->state == CS_S_MONITORING) {
> +            /* The ovsdb-cs was in MONITORING state, so we either had data
> +             * inconsistency on this server, or it stopped being the cluster
> +             * leader, or the user requested to re-connect.  Avoiding backoff
> +             * in these cases, as we need to re-connect as soon as possible.
> +             * Connections that are not in MONITORING state should have their
> +             * backoff to avoid constant flood of re-connection attempts in
> +             * case there is no suitable database server. */
> +            jsonrpc_session_set_backoff_free_tries(
> +                cs->session, jsonrpc_session_get_n_remotes(cs->session));
> +        }
>          jsonrpc_session_force_reconnect(cs->session);
>      }
>  }
> ---
> 
> This way, if the server looses leadership or inconsistency detected,
> the client will have 3 free attempts to find a new suitable server.
> After that it will start to backoff as it does now.  No changes in
> reconnect module required.
> 
> Thoughts?

This works for me.  I just have a question regarding the new API: should
we allow jsonrpc users to set the free tries to any value or shall we
make it more strict, e.g., jsonrpc_session_reset_backoff_free_tries(),
which would reset the number of free tries to 'n_remotes'?

Will you be sending a patch or shall I add your "Suggested-by"?

> 
> Best regards, Ilya Maximets.
> 

Thanks,
Dumitru
Ilya Maximets July 20, 2021, 2:25 p.m. UTC | #6
On 7/20/21 3:39 PM, Dumitru Ceara wrote:
> On 7/20/21 3:18 PM, Ilya Maximets wrote:
>> On 7/13/21 5:03 PM, Dumitru Ceara wrote:
>>> On 7/12/21 10:36 PM, Ilya Maximets wrote:
>>>> On 6/29/21 1:20 PM, Dumitru Ceara wrote:
>>>>> Until now clients that needed to reconnect immediately could only use
>>>>> reconnect_force_reconnect().  However, reconnect_force_reconnect()
>>>>> doesn't reset the backoff for connections that were alive long enough
>>>>> (more than backoff seconds).
>>>>>
>>>>> Moreover, the reconnect library cannot determine the exact reason why a
>>>>> client wishes to initiate a reconnection.  In most cases reconnection
>>>>> happens because of a fatal error when communicating with the remote,
>>>>> e.g., in the ovsdb-cs layer, when invalid messages are received from
>>>>> ovsdb-server.  In such cases it makes sense to not reset the backoff
>>>>> because the remote seems to be unhealthy.
>>>>>
>>>>> There are however cases when reconnection is needed for other reasons.
>>>>> One such example is when ovsdb-clients require "leader-only" connections
>>>>> to clustered ovsdb-server databases.  Whenever the client determines
>>>>> that the remote is not a leader anymore, it decides to reconnect to a
>>>>> new remote from its list, searching for the new leader.  Using
>>>>> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
>>>>> not reset backoff even though the former leader is still likely in good
>>>>> shape.
>>>>>
>>>>> Since 3c2d6274bcee ("raft: Transfer leadership before creating
>>>>> snapshots.") leadership changes inside the clustered database happen
>>>>> more often and therefore "leader-only" clients need to reconnect more
>>>>> often too.  Not resetting the backoff every time a leadership change
>>>>> happens will cause all reconnections to happen with the maximum backoff
>>>>> (8 seconds) resulting in significant latency.
>>>>>
>>>>> This commit also updates the Python reconnect and IDL implementations
>>>>> and adds tests for force-reconnect and graceful-reconnect.
>>>>>
>>>>> Reported-at: https://bugzilla.redhat.com/1977264
>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>>>> ---
>>>>
>>>> Hi, Dumitru.
>>>>
>>>
>>> Hi Ilya,
>>>
>>>> Thanks for working on this issue.  I've seen it in practice while running
>>>> OVN tests, but I still don't quiet understand why it happens.  Could you,
>>>> please, describe how state transitioning work here for the ovsdb-idl case?
>>>>
>>>
>>> Without the patch, assuming a current backoff of X seconds, the sequence
>>> of events (even for a connection that has seen activity after backoff)
>>> is:
>>>
>>> - reconnect_force_reconnect() -> move to S_RECONNECT
>>> - ovsdb_cs_run()
>>>   -> jsonrpc_session_run()
>>>      -> reconnect_run()
>>>      -> reconnect_disconnected()
>>>         -> because state is not S_ACTIVE | S_IDLE backoff is not
>>>            changed, stays X.
>>
>> Hmm, I see.  Thanks for explanation!
>>
>>>
>>>>> +# Forcefully reconnect.
>>>>> +force-reconnect
>>>>> +  in RECONNECT for 0 ms (2000 ms backoff)
>>>>> +  1 successful connections out of 3 attempts, seqno 2
>>>>> +  disconnected
>>>>> +run
>>>>> +  should disconnect
>>>>> +connecting
>>>>> +  in CONNECTING for 0 ms (2000 ms backoff)
>>>>
>>>> Especially this part seems wrong to me.  Because after 'should disconnect'
>>>> there should be 'disconnect' of 'connect-fail', but not 'connecting'.  We
>>>> literally should disconnect here, otherwise it's a violation of the reconnect
>>>> API.  And my concern is that ovsdb-cs or jsonrpc violates the API somewhere
>>>> by not calling reconnect_disconnectd() when it is required, or there is some
>>>> other bug that makes 'reconnect' module to jump over few states in a fsm.
>>>>
>>>> The logical workflow for the force-reconnect, from what I see in the code
>>>> should be:
>>>>
>>>> 1. force-reconnect --> transition to S_RECONNECT
>>>> 2. run -> in S_RECONNECT, so returning RECONNECT_DISCONNECT
>>>> 3. disconnect -> check the state, update backoff and transition to S_BACKOFF
>>>> 4. run -> in S_BACKOFF, so returning RECONNECT_CONNECT
>>>> 5. connected ....
>>>
>>> This is just a bug in the test case I added, I need to issue
>>> "disconnect" in the test case to trigger reconnect_disconnected()
>>> to be called (like ovsdb-cs does).
>>>
>>>>
>>>> Something is fishy here, because ovsdb-cs somehow jumps over step #3 and
>>>> maybe also #4.
>>>
>>> As mentioned above, it's not the case for ovsdb-cs.
>>>
>>> However, while checking the test case I realized that the patch will
>>> cause all "graceful" reconnects to happen with an initial backoff of 2
>>> seconds for long lived sessions (instead of 1s).
>>>
>>> That's because:
>>>
>>> reconnect_graceful_reconnect()
>>> -> reconnect_reset_backoff__()
>>>    -> session was active after the initial 'backoff' seconds, so
>>>       reset backoff to minimum.
>>>    -> reconnect_transition__(..., S_RECONNECT);
>>> ovsdb_cs_run()
>>> -> jsonrpc_session_run()
>>>    -> reconnect_run()
>>>    -> reconnect_disconnected():
>>>       -> if (!reconnect_reset_backoff__(..)) then double backoff.
>>>
>>> I see a couple of potential ways to fix that:
>>> 1. reconnect_graceful_reconnect() could bump fsm->backoff_free_tries to
>>> allow a single backoff free reconnect.
>>> 2. reconnect_graceful_reconnect() could reset fsm->backoff to 0 if it
>>> was active recently (instead of calling reconnect_reset_backoff__()).
>>>
>>> However, this is all under the assumption that we want to add support
>>> for two types of reconnect "semantics":
>>>
>>> a. forced, e.g., when the client detects inconsistencies in the data
>>> received from the server (the server is in "bad shape") in which we
>>> should never reset the backoff.
>>>
>>> b. graceful, e.g., when the client reconnects because it needs a
>>> leader-only connection and a leadership transfer happened on the
>>> server side (the server is likely in "good shape" just not a leader
>>> anymore).
>>>
>>> An alternative to all this is to change reconnect_disconnected() to
>>> reset backoff for connections in states S_ACTIVE | S_IDLE | S_RECONNECT
>>> (used to be just active and idle).
>>>
>>> I guess we could treat this as a bug fix and continue discussion for a
>>> separate follow up patch to add the "graceful vs force" semantics if
>>> they turn out to make sense for the future.
>>>
>>> In essence (excluding potential python and test changes) the patch would
>>> become:
>>>
>>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>>> index a929ddfd2d..3a6d93f9d1 100644
>>> --- a/lib/reconnect.c
>>> +++ b/lib/reconnect.c
>>> @@ -385,7 +385,7 @@ reconnect_disconnected(struct reconnect *fsm, long long int now, int error)
>>>          if (fsm->backoff_free_tries > 1) {
>>>              fsm->backoff_free_tries--;
>>>              fsm->backoff = 0;
>>> -        } else if (fsm->state & (S_ACTIVE | S_IDLE)
>>> +        } else if (fsm->state & (S_ACTIVE | S_IDLE | S_RECONNECT)
>>>                     && (fsm->last_activity - fsm->last_connected >= fsm->backoff
>>>                         || fsm->passive)) {
>>>              fsm->backoff = fsm->passive ? 0 : fsm->min_backoff;
>>> ---
>>>
>>> What do you think?
>>
>> The problem I see with adding S_RECONNECT to this condition is that
>> 'last_activity' and 'last_connected' doesn't belong to current
>> connection in case we're forcefully interrupting connection in a
>> S_CONNECTING state.  So, we will re-set backoff based on outdated
>> information.  Also, I think, in this case, 'last_connected' might
>> be larger than 'last_activity' and we will have a negative value here.
> 
> You're right, I missed this.
> 
>> All values are signed, so it should not be an issue, but it's not
>> a clean solution.
> 
> It's not, I agree.
> 
>>
>> Bumping the number of free tries seems like a better solution to me,
>> because:
>>
>> 1. I'm not sure that we need 2 types of reconnect.  I mean, backoff
>>    is intended to avoid overloading the already overloaded server with
>>    connection attempts.  In case of forceful re-connection the server
>>    is not overloaded, so we should not increase a backoff.
>>    IMO, backoff should only be involved in cases where we have
>>    problems on the server side related to load, not the leader-only
>>    stuff or even database inconsistency problems.
>>
>> 2. force-reconnect should not consume free tries for exactly same
>>    reasons as in point 1.  Free tries are to avoid backoff, but
>>    backoff exists only to avoid storming the remote server with
>>    connections while it's overloaded.
>>    No overload -> No need for backoff -> No need to consume free tries.
>>
>> 3. With bumping of free tries we're avoiding logical issues around
>>    using 'last_activity' and 'last_connected' from the old connection.
>>
>> 4. Faster reconnection, since backoff will be zero.  And that is
>>    fine, because as far as we know, the server is not overloaded,
>>    it's just not suitable for our needs for some reason.
>>    If it is overloaded, we will backoff after the first failed
>>    connection attempt.
> 
> This is even better (instead of min_backoff, I mean).
> 
>>
>> One potential problem I see with this solution is that if there
>> are no servers on a list that are suitable for ovsdb-cs, we will
>> have a log full of annoying reconnect logs.  Backoff kind of
>> rate-limits ovsdb-cs right now and we have no internal rate-limit
>> inside ovsdb-cs.  This looks more like an issue of ovsdb-cs and
>> not the issue of reconnect module itself.  But we need to have a
>> solution for this.
>>
>> And actually, the problem here is that from the jsonrpc point of
>> view the connection is perfectly fine and functional, but ovsdb-cs
>> implements high-level logic on top of it and decides that
>> connection is not good on a much later stage based on the actual
>> received data.  So, we need to somehow propagate this information
>> from ovsdb-cs down to reconnect module or allow ovsdb-cs to control
>> the state machine, otherwise we will need two separate backoffs:
> 
> That's what I was trying to implement with the "graceful reconnect:
> allow ovsdb-cs to decide how to reconnect.  But there are corner cases
> like the ones pointed out during the review of this patch.
> 
>> one inside the reconnect for usual connection problems and one
>> in ovsdb-cs for high-level data inconsistencies and leader changes.
>>
>> Thinking this way led me to a different solution.  We could expose
>> something like jsonrpc_session_set_backoff_free_tries() and allow
>> ovsdb-cs to make a decision.  E.g.:
>>
>> diff --git a/lib/ovsdb-cs.c b/lib/ovsdb-cs.c
>> index f13065c6c..900597b96 100644
>> --- a/lib/ovsdb-cs.c
>> +++ b/lib/ovsdb-cs.c
>> @@ -729,6 +729,17 @@ void
>>  ovsdb_cs_force_reconnect(struct ovsdb_cs *cs)
>>  {
>>      if (cs->session) {
>> +        if (cs->state == CS_S_MONITORING) {
>> +            /* The ovsdb-cs was in MONITORING state, so we either had data
>> +             * inconsistency on this server, or it stopped being the cluster
>> +             * leader, or the user requested to re-connect.  Avoiding backoff
>> +             * in these cases, as we need to re-connect as soon as possible.
>> +             * Connections that are not in MONITORING state should have their
>> +             * backoff to avoid constant flood of re-connection attempts in
>> +             * case there is no suitable database server. */
>> +            jsonrpc_session_set_backoff_free_tries(
>> +                cs->session, jsonrpc_session_get_n_remotes(cs->session));
>> +        }
>>          jsonrpc_session_force_reconnect(cs->session);
>>      }
>>  }
>> ---
>>
>> This way, if the server looses leadership or inconsistency detected,
>> the client will have 3 free attempts to find a new suitable server.
>> After that it will start to backoff as it does now.  No changes in
>> reconnect module required.
>>
>> Thoughts?
> 
> This works for me.  I just have a question regarding the new API: should
> we allow jsonrpc users to set the free tries to any value or shall we
> make it more strict, e.g., jsonrpc_session_reset_backoff_free_tries(),
> which would reset the number of free tries to 'n_remotes'?

jsonrpc_session_reset_backoff_free_tries() seems better, because ovsdb-cs
doesn't actually set number of free tries for jsonrpc from the start.
Maybe we can name it just jsonrpc_session_reset_backoff() ?  I'm not
sure, but I just don't like this very long name.

> 
> Will you be sending a patch or shall I add your "Suggested-by"?

I'm OK with "Suggested-by".
Would be also great to have some type of a test for this functionality.

> 
>>
>> Best regards, Ilya Maximets.
>>
> 
> Thanks,
> Dumitru
>
Ilya Maximets July 20, 2021, 2:29 p.m. UTC | #7
On 7/20/21 4:25 PM, Ilya Maximets wrote:
> On 7/20/21 3:39 PM, Dumitru Ceara wrote:
>> On 7/20/21 3:18 PM, Ilya Maximets wrote:
>>> On 7/13/21 5:03 PM, Dumitru Ceara wrote:
>>>> On 7/12/21 10:36 PM, Ilya Maximets wrote:
>>>>> On 6/29/21 1:20 PM, Dumitru Ceara wrote:
>>>>>> Until now clients that needed to reconnect immediately could only use
>>>>>> reconnect_force_reconnect().  However, reconnect_force_reconnect()
>>>>>> doesn't reset the backoff for connections that were alive long enough
>>>>>> (more than backoff seconds).
>>>>>>
>>>>>> Moreover, the reconnect library cannot determine the exact reason why a
>>>>>> client wishes to initiate a reconnection.  In most cases reconnection
>>>>>> happens because of a fatal error when communicating with the remote,
>>>>>> e.g., in the ovsdb-cs layer, when invalid messages are received from
>>>>>> ovsdb-server.  In such cases it makes sense to not reset the backoff
>>>>>> because the remote seems to be unhealthy.
>>>>>>
>>>>>> There are however cases when reconnection is needed for other reasons.
>>>>>> One such example is when ovsdb-clients require "leader-only" connections
>>>>>> to clustered ovsdb-server databases.  Whenever the client determines
>>>>>> that the remote is not a leader anymore, it decides to reconnect to a
>>>>>> new remote from its list, searching for the new leader.  Using
>>>>>> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
>>>>>> not reset backoff even though the former leader is still likely in good
>>>>>> shape.
>>>>>>
>>>>>> Since 3c2d6274bcee ("raft: Transfer leadership before creating
>>>>>> snapshots.") leadership changes inside the clustered database happen
>>>>>> more often and therefore "leader-only" clients need to reconnect more
>>>>>> often too.  Not resetting the backoff every time a leadership change
>>>>>> happens will cause all reconnections to happen with the maximum backoff
>>>>>> (8 seconds) resulting in significant latency.
>>>>>>
>>>>>> This commit also updates the Python reconnect and IDL implementations
>>>>>> and adds tests for force-reconnect and graceful-reconnect.
>>>>>>
>>>>>> Reported-at: https://bugzilla.redhat.com/1977264
>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>>>>> ---
>>>>>
>>>>> Hi, Dumitru.
>>>>>
>>>>
>>>> Hi Ilya,
>>>>
>>>>> Thanks for working on this issue.  I've seen it in practice while running
>>>>> OVN tests, but I still don't quiet understand why it happens.  Could you,
>>>>> please, describe how state transitioning work here for the ovsdb-idl case?
>>>>>
>>>>
>>>> Without the patch, assuming a current backoff of X seconds, the sequence
>>>> of events (even for a connection that has seen activity after backoff)
>>>> is:
>>>>
>>>> - reconnect_force_reconnect() -> move to S_RECONNECT
>>>> - ovsdb_cs_run()
>>>>   -> jsonrpc_session_run()
>>>>      -> reconnect_run()
>>>>      -> reconnect_disconnected()
>>>>         -> because state is not S_ACTIVE | S_IDLE backoff is not
>>>>            changed, stays X.
>>>
>>> Hmm, I see.  Thanks for explanation!
>>>
>>>>
>>>>>> +# Forcefully reconnect.
>>>>>> +force-reconnect
>>>>>> +  in RECONNECT for 0 ms (2000 ms backoff)
>>>>>> +  1 successful connections out of 3 attempts, seqno 2
>>>>>> +  disconnected
>>>>>> +run
>>>>>> +  should disconnect
>>>>>> +connecting
>>>>>> +  in CONNECTING for 0 ms (2000 ms backoff)
>>>>>
>>>>> Especially this part seems wrong to me.  Because after 'should disconnect'
>>>>> there should be 'disconnect' of 'connect-fail', but not 'connecting'.  We
>>>>> literally should disconnect here, otherwise it's a violation of the reconnect
>>>>> API.  And my concern is that ovsdb-cs or jsonrpc violates the API somewhere
>>>>> by not calling reconnect_disconnectd() when it is required, or there is some
>>>>> other bug that makes 'reconnect' module to jump over few states in a fsm.
>>>>>
>>>>> The logical workflow for the force-reconnect, from what I see in the code
>>>>> should be:
>>>>>
>>>>> 1. force-reconnect --> transition to S_RECONNECT
>>>>> 2. run -> in S_RECONNECT, so returning RECONNECT_DISCONNECT
>>>>> 3. disconnect -> check the state, update backoff and transition to S_BACKOFF
>>>>> 4. run -> in S_BACKOFF, so returning RECONNECT_CONNECT
>>>>> 5. connected ....
>>>>
>>>> This is just a bug in the test case I added, I need to issue
>>>> "disconnect" in the test case to trigger reconnect_disconnected()
>>>> to be called (like ovsdb-cs does).
>>>>
>>>>>
>>>>> Something is fishy here, because ovsdb-cs somehow jumps over step #3 and
>>>>> maybe also #4.
>>>>
>>>> As mentioned above, it's not the case for ovsdb-cs.
>>>>
>>>> However, while checking the test case I realized that the patch will
>>>> cause all "graceful" reconnects to happen with an initial backoff of 2
>>>> seconds for long lived sessions (instead of 1s).
>>>>
>>>> That's because:
>>>>
>>>> reconnect_graceful_reconnect()
>>>> -> reconnect_reset_backoff__()
>>>>    -> session was active after the initial 'backoff' seconds, so
>>>>       reset backoff to minimum.
>>>>    -> reconnect_transition__(..., S_RECONNECT);
>>>> ovsdb_cs_run()
>>>> -> jsonrpc_session_run()
>>>>    -> reconnect_run()
>>>>    -> reconnect_disconnected():
>>>>       -> if (!reconnect_reset_backoff__(..)) then double backoff.
>>>>
>>>> I see a couple of potential ways to fix that:
>>>> 1. reconnect_graceful_reconnect() could bump fsm->backoff_free_tries to
>>>> allow a single backoff free reconnect.
>>>> 2. reconnect_graceful_reconnect() could reset fsm->backoff to 0 if it
>>>> was active recently (instead of calling reconnect_reset_backoff__()).
>>>>
>>>> However, this is all under the assumption that we want to add support
>>>> for two types of reconnect "semantics":
>>>>
>>>> a. forced, e.g., when the client detects inconsistencies in the data
>>>> received from the server (the server is in "bad shape") in which we
>>>> should never reset the backoff.
>>>>
>>>> b. graceful, e.g., when the client reconnects because it needs a
>>>> leader-only connection and a leadership transfer happened on the
>>>> server side (the server is likely in "good shape" just not a leader
>>>> anymore).
>>>>
>>>> An alternative to all this is to change reconnect_disconnected() to
>>>> reset backoff for connections in states S_ACTIVE | S_IDLE | S_RECONNECT
>>>> (used to be just active and idle).
>>>>
>>>> I guess we could treat this as a bug fix and continue discussion for a
>>>> separate follow up patch to add the "graceful vs force" semantics if
>>>> they turn out to make sense for the future.
>>>>
>>>> In essence (excluding potential python and test changes) the patch would
>>>> become:
>>>>
>>>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>>>> index a929ddfd2d..3a6d93f9d1 100644
>>>> --- a/lib/reconnect.c
>>>> +++ b/lib/reconnect.c
>>>> @@ -385,7 +385,7 @@ reconnect_disconnected(struct reconnect *fsm, long long int now, int error)
>>>>          if (fsm->backoff_free_tries > 1) {
>>>>              fsm->backoff_free_tries--;
>>>>              fsm->backoff = 0;
>>>> -        } else if (fsm->state & (S_ACTIVE | S_IDLE)
>>>> +        } else if (fsm->state & (S_ACTIVE | S_IDLE | S_RECONNECT)
>>>>                     && (fsm->last_activity - fsm->last_connected >= fsm->backoff
>>>>                         || fsm->passive)) {
>>>>              fsm->backoff = fsm->passive ? 0 : fsm->min_backoff;
>>>> ---
>>>>
>>>> What do you think?
>>>
>>> The problem I see with adding S_RECONNECT to this condition is that
>>> 'last_activity' and 'last_connected' doesn't belong to current
>>> connection in case we're forcefully interrupting connection in a
>>> S_CONNECTING state.  So, we will re-set backoff based on outdated
>>> information.  Also, I think, in this case, 'last_connected' might
>>> be larger than 'last_activity' and we will have a negative value here.
>>
>> You're right, I missed this.
>>
>>> All values are signed, so it should not be an issue, but it's not
>>> a clean solution.
>>
>> It's not, I agree.
>>
>>>
>>> Bumping the number of free tries seems like a better solution to me,
>>> because:
>>>
>>> 1. I'm not sure that we need 2 types of reconnect.  I mean, backoff
>>>    is intended to avoid overloading the already overloaded server with
>>>    connection attempts.  In case of forceful re-connection the server
>>>    is not overloaded, so we should not increase a backoff.
>>>    IMO, backoff should only be involved in cases where we have
>>>    problems on the server side related to load, not the leader-only
>>>    stuff or even database inconsistency problems.
>>>
>>> 2. force-reconnect should not consume free tries for exactly same
>>>    reasons as in point 1.  Free tries are to avoid backoff, but
>>>    backoff exists only to avoid storming the remote server with
>>>    connections while it's overloaded.
>>>    No overload -> No need for backoff -> No need to consume free tries.
>>>
>>> 3. With bumping of free tries we're avoiding logical issues around
>>>    using 'last_activity' and 'last_connected' from the old connection.
>>>
>>> 4. Faster reconnection, since backoff will be zero.  And that is
>>>    fine, because as far as we know, the server is not overloaded,
>>>    it's just not suitable for our needs for some reason.
>>>    If it is overloaded, we will backoff after the first failed
>>>    connection attempt.
>>
>> This is even better (instead of min_backoff, I mean).
>>
>>>
>>> One potential problem I see with this solution is that if there
>>> are no servers on a list that are suitable for ovsdb-cs, we will
>>> have a log full of annoying reconnect logs.  Backoff kind of
>>> rate-limits ovsdb-cs right now and we have no internal rate-limit
>>> inside ovsdb-cs.  This looks more like an issue of ovsdb-cs and
>>> not the issue of reconnect module itself.  But we need to have a
>>> solution for this.
>>>
>>> And actually, the problem here is that from the jsonrpc point of
>>> view the connection is perfectly fine and functional, but ovsdb-cs
>>> implements high-level logic on top of it and decides that
>>> connection is not good on a much later stage based on the actual
>>> received data.  So, we need to somehow propagate this information
>>> from ovsdb-cs down to reconnect module or allow ovsdb-cs to control
>>> the state machine, otherwise we will need two separate backoffs:
>>
>> That's what I was trying to implement with the "graceful reconnect:
>> allow ovsdb-cs to decide how to reconnect.  But there are corner cases
>> like the ones pointed out during the review of this patch.
>>
>>> one inside the reconnect for usual connection problems and one
>>> in ovsdb-cs for high-level data inconsistencies and leader changes.
>>>
>>> Thinking this way led me to a different solution.  We could expose
>>> something like jsonrpc_session_set_backoff_free_tries() and allow
>>> ovsdb-cs to make a decision.  E.g.:
>>>
>>> diff --git a/lib/ovsdb-cs.c b/lib/ovsdb-cs.c
>>> index f13065c6c..900597b96 100644
>>> --- a/lib/ovsdb-cs.c
>>> +++ b/lib/ovsdb-cs.c
>>> @@ -729,6 +729,17 @@ void
>>>  ovsdb_cs_force_reconnect(struct ovsdb_cs *cs)
>>>  {
>>>      if (cs->session) {
>>> +        if (cs->state == CS_S_MONITORING) {
>>> +            /* The ovsdb-cs was in MONITORING state, so we either had data
>>> +             * inconsistency on this server, or it stopped being the cluster
>>> +             * leader, or the user requested to re-connect.  Avoiding backoff
>>> +             * in these cases, as we need to re-connect as soon as possible.
>>> +             * Connections that are not in MONITORING state should have their
>>> +             * backoff to avoid constant flood of re-connection attempts in
>>> +             * case there is no suitable database server. */
>>> +            jsonrpc_session_set_backoff_free_tries(
>>> +                cs->session, jsonrpc_session_get_n_remotes(cs->session));
>>> +        }
>>>          jsonrpc_session_force_reconnect(cs->session);
>>>      }
>>>  }
>>> ---
>>>
>>> This way, if the server looses leadership or inconsistency detected,
>>> the client will have 3 free attempts to find a new suitable server.
>>> After that it will start to backoff as it does now.  No changes in
>>> reconnect module required.
>>>
>>> Thoughts?
>>
>> This works for me.  I just have a question regarding the new API: should
>> we allow jsonrpc users to set the free tries to any value or shall we
>> make it more strict, e.g., jsonrpc_session_reset_backoff_free_tries(),
>> which would reset the number of free tries to 'n_remotes'?

And it should, probably, set number of free tries to n_remotes + 1,
because it will count current disconnection as an attempt.

> 
> jsonrpc_session_reset_backoff_free_tries() seems better, because ovsdb-cs
> doesn't actually set number of free tries for jsonrpc from the start.
> Maybe we can name it just jsonrpc_session_reset_backoff() ?  I'm not
> sure, but I just don't like this very long name.
> 
>>
>> Will you be sending a patch or shall I add your "Suggested-by"?
> 
> I'm OK with "Suggested-by".
> Would be also great to have some type of a test for this functionality.
> 
>>
>>>
>>> Best regards, Ilya Maximets.
>>>
>>
>> Thanks,
>> Dumitru
>>
>
Dumitru Ceara July 20, 2021, 2:32 p.m. UTC | #8
On 7/20/21 4:29 PM, Ilya Maximets wrote:
> On 7/20/21 4:25 PM, Ilya Maximets wrote:
>> On 7/20/21 3:39 PM, Dumitru Ceara wrote:
>>> On 7/20/21 3:18 PM, Ilya Maximets wrote:
>>>> On 7/13/21 5:03 PM, Dumitru Ceara wrote:
>>>>> On 7/12/21 10:36 PM, Ilya Maximets wrote:
>>>>>> On 6/29/21 1:20 PM, Dumitru Ceara wrote:
>>>>>>> Until now clients that needed to reconnect immediately could only use
>>>>>>> reconnect_force_reconnect().  However, reconnect_force_reconnect()
>>>>>>> doesn't reset the backoff for connections that were alive long enough
>>>>>>> (more than backoff seconds).
>>>>>>>
>>>>>>> Moreover, the reconnect library cannot determine the exact reason why a
>>>>>>> client wishes to initiate a reconnection.  In most cases reconnection
>>>>>>> happens because of a fatal error when communicating with the remote,
>>>>>>> e.g., in the ovsdb-cs layer, when invalid messages are received from
>>>>>>> ovsdb-server.  In such cases it makes sense to not reset the backoff
>>>>>>> because the remote seems to be unhealthy.
>>>>>>>
>>>>>>> There are however cases when reconnection is needed for other reasons.
>>>>>>> One such example is when ovsdb-clients require "leader-only" connections
>>>>>>> to clustered ovsdb-server databases.  Whenever the client determines
>>>>>>> that the remote is not a leader anymore, it decides to reconnect to a
>>>>>>> new remote from its list, searching for the new leader.  Using
>>>>>>> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
>>>>>>> not reset backoff even though the former leader is still likely in good
>>>>>>> shape.
>>>>>>>
>>>>>>> Since 3c2d6274bcee ("raft: Transfer leadership before creating
>>>>>>> snapshots.") leadership changes inside the clustered database happen
>>>>>>> more often and therefore "leader-only" clients need to reconnect more
>>>>>>> often too.  Not resetting the backoff every time a leadership change
>>>>>>> happens will cause all reconnections to happen with the maximum backoff
>>>>>>> (8 seconds) resulting in significant latency.
>>>>>>>
>>>>>>> This commit also updates the Python reconnect and IDL implementations
>>>>>>> and adds tests for force-reconnect and graceful-reconnect.
>>>>>>>
>>>>>>> Reported-at: https://bugzilla.redhat.com/1977264
>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>>>>>> ---
>>>>>>
>>>>>> Hi, Dumitru.
>>>>>>
>>>>>
>>>>> Hi Ilya,
>>>>>
>>>>>> Thanks for working on this issue.  I've seen it in practice while running
>>>>>> OVN tests, but I still don't quiet understand why it happens.  Could you,
>>>>>> please, describe how state transitioning work here for the ovsdb-idl case?
>>>>>>
>>>>>
>>>>> Without the patch, assuming a current backoff of X seconds, the sequence
>>>>> of events (even for a connection that has seen activity after backoff)
>>>>> is:
>>>>>
>>>>> - reconnect_force_reconnect() -> move to S_RECONNECT
>>>>> - ovsdb_cs_run()
>>>>>   -> jsonrpc_session_run()
>>>>>      -> reconnect_run()
>>>>>      -> reconnect_disconnected()
>>>>>         -> because state is not S_ACTIVE | S_IDLE backoff is not
>>>>>            changed, stays X.
>>>>
>>>> Hmm, I see.  Thanks for explanation!
>>>>
>>>>>
>>>>>>> +# Forcefully reconnect.
>>>>>>> +force-reconnect
>>>>>>> +  in RECONNECT for 0 ms (2000 ms backoff)
>>>>>>> +  1 successful connections out of 3 attempts, seqno 2
>>>>>>> +  disconnected
>>>>>>> +run
>>>>>>> +  should disconnect
>>>>>>> +connecting
>>>>>>> +  in CONNECTING for 0 ms (2000 ms backoff)
>>>>>>
>>>>>> Especially this part seems wrong to me.  Because after 'should disconnect'
>>>>>> there should be 'disconnect' of 'connect-fail', but not 'connecting'.  We
>>>>>> literally should disconnect here, otherwise it's a violation of the reconnect
>>>>>> API.  And my concern is that ovsdb-cs or jsonrpc violates the API somewhere
>>>>>> by not calling reconnect_disconnectd() when it is required, or there is some
>>>>>> other bug that makes 'reconnect' module to jump over few states in a fsm.
>>>>>>
>>>>>> The logical workflow for the force-reconnect, from what I see in the code
>>>>>> should be:
>>>>>>
>>>>>> 1. force-reconnect --> transition to S_RECONNECT
>>>>>> 2. run -> in S_RECONNECT, so returning RECONNECT_DISCONNECT
>>>>>> 3. disconnect -> check the state, update backoff and transition to S_BACKOFF
>>>>>> 4. run -> in S_BACKOFF, so returning RECONNECT_CONNECT
>>>>>> 5. connected ....
>>>>>
>>>>> This is just a bug in the test case I added, I need to issue
>>>>> "disconnect" in the test case to trigger reconnect_disconnected()
>>>>> to be called (like ovsdb-cs does).
>>>>>
>>>>>>
>>>>>> Something is fishy here, because ovsdb-cs somehow jumps over step #3 and
>>>>>> maybe also #4.
>>>>>
>>>>> As mentioned above, it's not the case for ovsdb-cs.
>>>>>
>>>>> However, while checking the test case I realized that the patch will
>>>>> cause all "graceful" reconnects to happen with an initial backoff of 2
>>>>> seconds for long lived sessions (instead of 1s).
>>>>>
>>>>> That's because:
>>>>>
>>>>> reconnect_graceful_reconnect()
>>>>> -> reconnect_reset_backoff__()
>>>>>    -> session was active after the initial 'backoff' seconds, so
>>>>>       reset backoff to minimum.
>>>>>    -> reconnect_transition__(..., S_RECONNECT);
>>>>> ovsdb_cs_run()
>>>>> -> jsonrpc_session_run()
>>>>>    -> reconnect_run()
>>>>>    -> reconnect_disconnected():
>>>>>       -> if (!reconnect_reset_backoff__(..)) then double backoff.
>>>>>
>>>>> I see a couple of potential ways to fix that:
>>>>> 1. reconnect_graceful_reconnect() could bump fsm->backoff_free_tries to
>>>>> allow a single backoff free reconnect.
>>>>> 2. reconnect_graceful_reconnect() could reset fsm->backoff to 0 if it
>>>>> was active recently (instead of calling reconnect_reset_backoff__()).
>>>>>
>>>>> However, this is all under the assumption that we want to add support
>>>>> for two types of reconnect "semantics":
>>>>>
>>>>> a. forced, e.g., when the client detects inconsistencies in the data
>>>>> received from the server (the server is in "bad shape") in which we
>>>>> should never reset the backoff.
>>>>>
>>>>> b. graceful, e.g., when the client reconnects because it needs a
>>>>> leader-only connection and a leadership transfer happened on the
>>>>> server side (the server is likely in "good shape" just not a leader
>>>>> anymore).
>>>>>
>>>>> An alternative to all this is to change reconnect_disconnected() to
>>>>> reset backoff for connections in states S_ACTIVE | S_IDLE | S_RECONNECT
>>>>> (used to be just active and idle).
>>>>>
>>>>> I guess we could treat this as a bug fix and continue discussion for a
>>>>> separate follow up patch to add the "graceful vs force" semantics if
>>>>> they turn out to make sense for the future.
>>>>>
>>>>> In essence (excluding potential python and test changes) the patch would
>>>>> become:
>>>>>
>>>>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>>>>> index a929ddfd2d..3a6d93f9d1 100644
>>>>> --- a/lib/reconnect.c
>>>>> +++ b/lib/reconnect.c
>>>>> @@ -385,7 +385,7 @@ reconnect_disconnected(struct reconnect *fsm, long long int now, int error)
>>>>>          if (fsm->backoff_free_tries > 1) {
>>>>>              fsm->backoff_free_tries--;
>>>>>              fsm->backoff = 0;
>>>>> -        } else if (fsm->state & (S_ACTIVE | S_IDLE)
>>>>> +        } else if (fsm->state & (S_ACTIVE | S_IDLE | S_RECONNECT)
>>>>>                     && (fsm->last_activity - fsm->last_connected >= fsm->backoff
>>>>>                         || fsm->passive)) {
>>>>>              fsm->backoff = fsm->passive ? 0 : fsm->min_backoff;
>>>>> ---
>>>>>
>>>>> What do you think?
>>>>
>>>> The problem I see with adding S_RECONNECT to this condition is that
>>>> 'last_activity' and 'last_connected' doesn't belong to current
>>>> connection in case we're forcefully interrupting connection in a
>>>> S_CONNECTING state.  So, we will re-set backoff based on outdated
>>>> information.  Also, I think, in this case, 'last_connected' might
>>>> be larger than 'last_activity' and we will have a negative value here.
>>>
>>> You're right, I missed this.
>>>
>>>> All values are signed, so it should not be an issue, but it's not
>>>> a clean solution.
>>>
>>> It's not, I agree.
>>>
>>>>
>>>> Bumping the number of free tries seems like a better solution to me,
>>>> because:
>>>>
>>>> 1. I'm not sure that we need 2 types of reconnect.  I mean, backoff
>>>>    is intended to avoid overloading the already overloaded server with
>>>>    connection attempts.  In case of forceful re-connection the server
>>>>    is not overloaded, so we should not increase a backoff.
>>>>    IMO, backoff should only be involved in cases where we have
>>>>    problems on the server side related to load, not the leader-only
>>>>    stuff or even database inconsistency problems.
>>>>
>>>> 2. force-reconnect should not consume free tries for exactly same
>>>>    reasons as in point 1.  Free tries are to avoid backoff, but
>>>>    backoff exists only to avoid storming the remote server with
>>>>    connections while it's overloaded.
>>>>    No overload -> No need for backoff -> No need to consume free tries.
>>>>
>>>> 3. With bumping of free tries we're avoiding logical issues around
>>>>    using 'last_activity' and 'last_connected' from the old connection.
>>>>
>>>> 4. Faster reconnection, since backoff will be zero.  And that is
>>>>    fine, because as far as we know, the server is not overloaded,
>>>>    it's just not suitable for our needs for some reason.
>>>>    If it is overloaded, we will backoff after the first failed
>>>>    connection attempt.
>>>
>>> This is even better (instead of min_backoff, I mean).
>>>
>>>>
>>>> One potential problem I see with this solution is that if there
>>>> are no servers on a list that are suitable for ovsdb-cs, we will
>>>> have a log full of annoying reconnect logs.  Backoff kind of
>>>> rate-limits ovsdb-cs right now and we have no internal rate-limit
>>>> inside ovsdb-cs.  This looks more like an issue of ovsdb-cs and
>>>> not the issue of reconnect module itself.  But we need to have a
>>>> solution for this.
>>>>
>>>> And actually, the problem here is that from the jsonrpc point of
>>>> view the connection is perfectly fine and functional, but ovsdb-cs
>>>> implements high-level logic on top of it and decides that
>>>> connection is not good on a much later stage based on the actual
>>>> received data.  So, we need to somehow propagate this information
>>>> from ovsdb-cs down to reconnect module or allow ovsdb-cs to control
>>>> the state machine, otherwise we will need two separate backoffs:
>>>
>>> That's what I was trying to implement with the "graceful reconnect:
>>> allow ovsdb-cs to decide how to reconnect.  But there are corner cases
>>> like the ones pointed out during the review of this patch.
>>>
>>>> one inside the reconnect for usual connection problems and one
>>>> in ovsdb-cs for high-level data inconsistencies and leader changes.
>>>>
>>>> Thinking this way led me to a different solution.  We could expose
>>>> something like jsonrpc_session_set_backoff_free_tries() and allow
>>>> ovsdb-cs to make a decision.  E.g.:
>>>>
>>>> diff --git a/lib/ovsdb-cs.c b/lib/ovsdb-cs.c
>>>> index f13065c6c..900597b96 100644
>>>> --- a/lib/ovsdb-cs.c
>>>> +++ b/lib/ovsdb-cs.c
>>>> @@ -729,6 +729,17 @@ void
>>>>  ovsdb_cs_force_reconnect(struct ovsdb_cs *cs)
>>>>  {
>>>>      if (cs->session) {
>>>> +        if (cs->state == CS_S_MONITORING) {
>>>> +            /* The ovsdb-cs was in MONITORING state, so we either had data
>>>> +             * inconsistency on this server, or it stopped being the cluster
>>>> +             * leader, or the user requested to re-connect.  Avoiding backoff
>>>> +             * in these cases, as we need to re-connect as soon as possible.
>>>> +             * Connections that are not in MONITORING state should have their
>>>> +             * backoff to avoid constant flood of re-connection attempts in
>>>> +             * case there is no suitable database server. */
>>>> +            jsonrpc_session_set_backoff_free_tries(
>>>> +                cs->session, jsonrpc_session_get_n_remotes(cs->session));
>>>> +        }
>>>>          jsonrpc_session_force_reconnect(cs->session);
>>>>      }
>>>>  }
>>>> ---
>>>>
>>>> This way, if the server looses leadership or inconsistency detected,
>>>> the client will have 3 free attempts to find a new suitable server.
>>>> After that it will start to backoff as it does now.  No changes in
>>>> reconnect module required.
>>>>
>>>> Thoughts?
>>>
>>> This works for me.  I just have a question regarding the new API: should
>>> we allow jsonrpc users to set the free tries to any value or shall we
>>> make it more strict, e.g., jsonrpc_session_reset_backoff_free_tries(),
>>> which would reset the number of free tries to 'n_remotes'?
> 
> And it should, probably, set number of free tries to n_remotes + 1,
> because it will count current disconnection as an attempt.
> 

Makes sense.

>>
>> jsonrpc_session_reset_backoff_free_tries() seems better, because ovsdb-cs
>> doesn't actually set number of free tries for jsonrpc from the start.
>> Maybe we can name it just jsonrpc_session_reset_backoff() ?  I'm not
>> sure, but I just don't like this very long name.
>>
>>>
>>> Will you be sending a patch or shall I add your "Suggested-by"?
>>
>> I'm OK with "Suggested-by".
>> Would be also great to have some type of a test for this functionality.
>>

Let me try to come up with something.  I'll post a v2 these days.

Regards,
Dumitru
Dumitru Ceara July 21, 2021, 10:58 a.m. UTC | #9
On 7/20/21 4:32 PM, Dumitru Ceara wrote:
> On 7/20/21 4:29 PM, Ilya Maximets wrote:
>> On 7/20/21 4:25 PM, Ilya Maximets wrote:
>>> On 7/20/21 3:39 PM, Dumitru Ceara wrote:
>>>> On 7/20/21 3:18 PM, Ilya Maximets wrote:
>>>>> On 7/13/21 5:03 PM, Dumitru Ceara wrote:
>>>>>> On 7/12/21 10:36 PM, Ilya Maximets wrote:
>>>>>>> On 6/29/21 1:20 PM, Dumitru Ceara wrote:
>>>>>>>> Until now clients that needed to reconnect immediately could only use
>>>>>>>> reconnect_force_reconnect().  However, reconnect_force_reconnect()
>>>>>>>> doesn't reset the backoff for connections that were alive long enough
>>>>>>>> (more than backoff seconds).
>>>>>>>>
>>>>>>>> Moreover, the reconnect library cannot determine the exact reason why a
>>>>>>>> client wishes to initiate a reconnection.  In most cases reconnection
>>>>>>>> happens because of a fatal error when communicating with the remote,
>>>>>>>> e.g., in the ovsdb-cs layer, when invalid messages are received from
>>>>>>>> ovsdb-server.  In such cases it makes sense to not reset the backoff
>>>>>>>> because the remote seems to be unhealthy.
>>>>>>>>
>>>>>>>> There are however cases when reconnection is needed for other reasons.
>>>>>>>> One such example is when ovsdb-clients require "leader-only" connections
>>>>>>>> to clustered ovsdb-server databases.  Whenever the client determines
>>>>>>>> that the remote is not a leader anymore, it decides to reconnect to a
>>>>>>>> new remote from its list, searching for the new leader.  Using
>>>>>>>> jsonrpc_force_reconnect() (which calls reconnect_force_reconnect()) will
>>>>>>>> not reset backoff even though the former leader is still likely in good
>>>>>>>> shape.
>>>>>>>>
>>>>>>>> Since 3c2d6274bcee ("raft: Transfer leadership before creating
>>>>>>>> snapshots.") leadership changes inside the clustered database happen
>>>>>>>> more often and therefore "leader-only" clients need to reconnect more
>>>>>>>> often too.  Not resetting the backoff every time a leadership change
>>>>>>>> happens will cause all reconnections to happen with the maximum backoff
>>>>>>>> (8 seconds) resulting in significant latency.
>>>>>>>>
>>>>>>>> This commit also updates the Python reconnect and IDL implementations
>>>>>>>> and adds tests for force-reconnect and graceful-reconnect.
>>>>>>>>
>>>>>>>> Reported-at: https://bugzilla.redhat.com/1977264
>>>>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>>>>>>> ---
>>>>>>>
>>>>>>> Hi, Dumitru.
>>>>>>>
>>>>>>
>>>>>> Hi Ilya,
>>>>>>
>>>>>>> Thanks for working on this issue.  I've seen it in practice while running
>>>>>>> OVN tests, but I still don't quiet understand why it happens.  Could you,
>>>>>>> please, describe how state transitioning work here for the ovsdb-idl case?
>>>>>>>
>>>>>>
>>>>>> Without the patch, assuming a current backoff of X seconds, the sequence
>>>>>> of events (even for a connection that has seen activity after backoff)
>>>>>> is:
>>>>>>
>>>>>> - reconnect_force_reconnect() -> move to S_RECONNECT
>>>>>> - ovsdb_cs_run()
>>>>>>   -> jsonrpc_session_run()
>>>>>>      -> reconnect_run()
>>>>>>      -> reconnect_disconnected()
>>>>>>         -> because state is not S_ACTIVE | S_IDLE backoff is not
>>>>>>            changed, stays X.
>>>>>
>>>>> Hmm, I see.  Thanks for explanation!
>>>>>
>>>>>>
>>>>>>>> +# Forcefully reconnect.
>>>>>>>> +force-reconnect
>>>>>>>> +  in RECONNECT for 0 ms (2000 ms backoff)
>>>>>>>> +  1 successful connections out of 3 attempts, seqno 2
>>>>>>>> +  disconnected
>>>>>>>> +run
>>>>>>>> +  should disconnect
>>>>>>>> +connecting
>>>>>>>> +  in CONNECTING for 0 ms (2000 ms backoff)
>>>>>>>
>>>>>>> Especially this part seems wrong to me.  Because after 'should disconnect'
>>>>>>> there should be 'disconnect' of 'connect-fail', but not 'connecting'.  We
>>>>>>> literally should disconnect here, otherwise it's a violation of the reconnect
>>>>>>> API.  And my concern is that ovsdb-cs or jsonrpc violates the API somewhere
>>>>>>> by not calling reconnect_disconnectd() when it is required, or there is some
>>>>>>> other bug that makes 'reconnect' module to jump over few states in a fsm.
>>>>>>>
>>>>>>> The logical workflow for the force-reconnect, from what I see in the code
>>>>>>> should be:
>>>>>>>
>>>>>>> 1. force-reconnect --> transition to S_RECONNECT
>>>>>>> 2. run -> in S_RECONNECT, so returning RECONNECT_DISCONNECT
>>>>>>> 3. disconnect -> check the state, update backoff and transition to S_BACKOFF
>>>>>>> 4. run -> in S_BACKOFF, so returning RECONNECT_CONNECT
>>>>>>> 5. connected ....
>>>>>>
>>>>>> This is just a bug in the test case I added, I need to issue
>>>>>> "disconnect" in the test case to trigger reconnect_disconnected()
>>>>>> to be called (like ovsdb-cs does).
>>>>>>
>>>>>>>
>>>>>>> Something is fishy here, because ovsdb-cs somehow jumps over step #3 and
>>>>>>> maybe also #4.
>>>>>>
>>>>>> As mentioned above, it's not the case for ovsdb-cs.
>>>>>>
>>>>>> However, while checking the test case I realized that the patch will
>>>>>> cause all "graceful" reconnects to happen with an initial backoff of 2
>>>>>> seconds for long lived sessions (instead of 1s).
>>>>>>
>>>>>> That's because:
>>>>>>
>>>>>> reconnect_graceful_reconnect()
>>>>>> -> reconnect_reset_backoff__()
>>>>>>    -> session was active after the initial 'backoff' seconds, so
>>>>>>       reset backoff to minimum.
>>>>>>    -> reconnect_transition__(..., S_RECONNECT);
>>>>>> ovsdb_cs_run()
>>>>>> -> jsonrpc_session_run()
>>>>>>    -> reconnect_run()
>>>>>>    -> reconnect_disconnected():
>>>>>>       -> if (!reconnect_reset_backoff__(..)) then double backoff.
>>>>>>
>>>>>> I see a couple of potential ways to fix that:
>>>>>> 1. reconnect_graceful_reconnect() could bump fsm->backoff_free_tries to
>>>>>> allow a single backoff free reconnect.
>>>>>> 2. reconnect_graceful_reconnect() could reset fsm->backoff to 0 if it
>>>>>> was active recently (instead of calling reconnect_reset_backoff__()).
>>>>>>
>>>>>> However, this is all under the assumption that we want to add support
>>>>>> for two types of reconnect "semantics":
>>>>>>
>>>>>> a. forced, e.g., when the client detects inconsistencies in the data
>>>>>> received from the server (the server is in "bad shape") in which we
>>>>>> should never reset the backoff.
>>>>>>
>>>>>> b. graceful, e.g., when the client reconnects because it needs a
>>>>>> leader-only connection and a leadership transfer happened on the
>>>>>> server side (the server is likely in "good shape" just not a leader
>>>>>> anymore).
>>>>>>
>>>>>> An alternative to all this is to change reconnect_disconnected() to
>>>>>> reset backoff for connections in states S_ACTIVE | S_IDLE | S_RECONNECT
>>>>>> (used to be just active and idle).
>>>>>>
>>>>>> I guess we could treat this as a bug fix and continue discussion for a
>>>>>> separate follow up patch to add the "graceful vs force" semantics if
>>>>>> they turn out to make sense for the future.
>>>>>>
>>>>>> In essence (excluding potential python and test changes) the patch would
>>>>>> become:
>>>>>>
>>>>>> diff --git a/lib/reconnect.c b/lib/reconnect.c
>>>>>> index a929ddfd2d..3a6d93f9d1 100644
>>>>>> --- a/lib/reconnect.c
>>>>>> +++ b/lib/reconnect.c
>>>>>> @@ -385,7 +385,7 @@ reconnect_disconnected(struct reconnect *fsm, long long int now, int error)
>>>>>>          if (fsm->backoff_free_tries > 1) {
>>>>>>              fsm->backoff_free_tries--;
>>>>>>              fsm->backoff = 0;
>>>>>> -        } else if (fsm->state & (S_ACTIVE | S_IDLE)
>>>>>> +        } else if (fsm->state & (S_ACTIVE | S_IDLE | S_RECONNECT)
>>>>>>                     && (fsm->last_activity - fsm->last_connected >= fsm->backoff
>>>>>>                         || fsm->passive)) {
>>>>>>              fsm->backoff = fsm->passive ? 0 : fsm->min_backoff;
>>>>>> ---
>>>>>>
>>>>>> What do you think?
>>>>>
>>>>> The problem I see with adding S_RECONNECT to this condition is that
>>>>> 'last_activity' and 'last_connected' doesn't belong to current
>>>>> connection in case we're forcefully interrupting connection in a
>>>>> S_CONNECTING state.  So, we will re-set backoff based on outdated
>>>>> information.  Also, I think, in this case, 'last_connected' might
>>>>> be larger than 'last_activity' and we will have a negative value here.
>>>>
>>>> You're right, I missed this.
>>>>
>>>>> All values are signed, so it should not be an issue, but it's not
>>>>> a clean solution.
>>>>
>>>> It's not, I agree.
>>>>
>>>>>
>>>>> Bumping the number of free tries seems like a better solution to me,
>>>>> because:
>>>>>
>>>>> 1. I'm not sure that we need 2 types of reconnect.  I mean, backoff
>>>>>    is intended to avoid overloading the already overloaded server with
>>>>>    connection attempts.  In case of forceful re-connection the server
>>>>>    is not overloaded, so we should not increase a backoff.
>>>>>    IMO, backoff should only be involved in cases where we have
>>>>>    problems on the server side related to load, not the leader-only
>>>>>    stuff or even database inconsistency problems.
>>>>>
>>>>> 2. force-reconnect should not consume free tries for exactly same
>>>>>    reasons as in point 1.  Free tries are to avoid backoff, but
>>>>>    backoff exists only to avoid storming the remote server with
>>>>>    connections while it's overloaded.
>>>>>    No overload -> No need for backoff -> No need to consume free tries.
>>>>>
>>>>> 3. With bumping of free tries we're avoiding logical issues around
>>>>>    using 'last_activity' and 'last_connected' from the old connection.
>>>>>
>>>>> 4. Faster reconnection, since backoff will be zero.  And that is
>>>>>    fine, because as far as we know, the server is not overloaded,
>>>>>    it's just not suitable for our needs for some reason.
>>>>>    If it is overloaded, we will backoff after the first failed
>>>>>    connection attempt.
>>>>
>>>> This is even better (instead of min_backoff, I mean).
>>>>
>>>>>
>>>>> One potential problem I see with this solution is that if there
>>>>> are no servers on a list that are suitable for ovsdb-cs, we will
>>>>> have a log full of annoying reconnect logs.  Backoff kind of
>>>>> rate-limits ovsdb-cs right now and we have no internal rate-limit
>>>>> inside ovsdb-cs.  This looks more like an issue of ovsdb-cs and
>>>>> not the issue of reconnect module itself.  But we need to have a
>>>>> solution for this.
>>>>>
>>>>> And actually, the problem here is that from the jsonrpc point of
>>>>> view the connection is perfectly fine and functional, but ovsdb-cs
>>>>> implements high-level logic on top of it and decides that
>>>>> connection is not good on a much later stage based on the actual
>>>>> received data.  So, we need to somehow propagate this information
>>>>> from ovsdb-cs down to reconnect module or allow ovsdb-cs to control
>>>>> the state machine, otherwise we will need two separate backoffs:
>>>>
>>>> That's what I was trying to implement with the "graceful reconnect:
>>>> allow ovsdb-cs to decide how to reconnect.  But there are corner cases
>>>> like the ones pointed out during the review of this patch.
>>>>
>>>>> one inside the reconnect for usual connection problems and one
>>>>> in ovsdb-cs for high-level data inconsistencies and leader changes.
>>>>>
>>>>> Thinking this way led me to a different solution.  We could expose
>>>>> something like jsonrpc_session_set_backoff_free_tries() and allow
>>>>> ovsdb-cs to make a decision.  E.g.:
>>>>>
>>>>> diff --git a/lib/ovsdb-cs.c b/lib/ovsdb-cs.c
>>>>> index f13065c6c..900597b96 100644
>>>>> --- a/lib/ovsdb-cs.c
>>>>> +++ b/lib/ovsdb-cs.c
>>>>> @@ -729,6 +729,17 @@ void
>>>>>  ovsdb_cs_force_reconnect(struct ovsdb_cs *cs)
>>>>>  {
>>>>>      if (cs->session) {
>>>>> +        if (cs->state == CS_S_MONITORING) {
>>>>> +            /* The ovsdb-cs was in MONITORING state, so we either had data
>>>>> +             * inconsistency on this server, or it stopped being the cluster
>>>>> +             * leader, or the user requested to re-connect.  Avoiding backoff
>>>>> +             * in these cases, as we need to re-connect as soon as possible.
>>>>> +             * Connections that are not in MONITORING state should have their
>>>>> +             * backoff to avoid constant flood of re-connection attempts in
>>>>> +             * case there is no suitable database server. */
>>>>> +            jsonrpc_session_set_backoff_free_tries(
>>>>> +                cs->session, jsonrpc_session_get_n_remotes(cs->session));
>>>>> +        }
>>>>>          jsonrpc_session_force_reconnect(cs->session);
>>>>>      }
>>>>>  }
>>>>> ---
>>>>>
>>>>> This way, if the server looses leadership or inconsistency detected,
>>>>> the client will have 3 free attempts to find a new suitable server.
>>>>> After that it will start to backoff as it does now.  No changes in
>>>>> reconnect module required.
>>>>>
>>>>> Thoughts?
>>>>
>>>> This works for me.  I just have a question regarding the new API: should
>>>> we allow jsonrpc users to set the free tries to any value or shall we
>>>> make it more strict, e.g., jsonrpc_session_reset_backoff_free_tries(),
>>>> which would reset the number of free tries to 'n_remotes'?
>>
>> And it should, probably, set number of free tries to n_remotes + 1,
>> because it will count current disconnection as an attempt.
>>
> 
> Makes sense.
> 
>>>
>>> jsonrpc_session_reset_backoff_free_tries() seems better, because ovsdb-cs
>>> doesn't actually set number of free tries for jsonrpc from the start.
>>> Maybe we can name it just jsonrpc_session_reset_backoff() ?  I'm not
>>> sure, but I just don't like this very long name.
>>>
>>>>
>>>> Will you be sending a patch or shall I add your "Suggested-by"?
>>>
>>> I'm OK with "Suggested-by".
>>> Would be also great to have some type of a test for this functionality.
>>>
> 
> Let me try to come up with something.  I'll post a v2 these days.
> 

v2 available at:
https://patchwork.ozlabs.org/project/openvswitch/patch/20210721105614.25498-1-dceara@redhat.com/

Thanks,
Dumitru
diff mbox series

Patch

diff --git a/lib/jsonrpc.c b/lib/jsonrpc.c
index 926cbcb86e..66b7f9ef4c 100644
--- a/lib/jsonrpc.c
+++ b/lib/jsonrpc.c
@@ -1267,6 +1267,13 @@  jsonrpc_session_force_reconnect(struct jsonrpc_session *s)
     reconnect_force_reconnect(s->reconnect, time_msec());
 }
 
+/* Makes 's' gracefully drop its connection (if any) and reconnect. */
+void
+jsonrpc_session_graceful_reconnect(struct jsonrpc_session *s)
+{
+    reconnect_graceful_reconnect(s->reconnect, time_msec());
+}
+
 /* Sets 'max_backoff' as the maximum time, in milliseconds, to wait after a
  * connection attempt fails before attempting to connect again. */
 void
diff --git a/lib/jsonrpc.h b/lib/jsonrpc.h
index d75d66b863..3a34424e22 100644
--- a/lib/jsonrpc.h
+++ b/lib/jsonrpc.h
@@ -136,6 +136,7 @@  void jsonrpc_session_get_reconnect_stats(const struct jsonrpc_session *,
 
 void jsonrpc_session_enable_reconnect(struct jsonrpc_session *);
 void jsonrpc_session_force_reconnect(struct jsonrpc_session *);
+void jsonrpc_session_graceful_reconnect(struct jsonrpc_session *);
 
 void jsonrpc_session_set_max_backoff(struct jsonrpc_session *,
                                      int max_backoff);
diff --git a/lib/ovsdb-cs.c b/lib/ovsdb-cs.c
index 911b71dd4f..6f16d29f6d 100644
--- a/lib/ovsdb-cs.c
+++ b/lib/ovsdb-cs.c
@@ -230,8 +230,10 @@  static void ovsdb_cs_transition_at(struct ovsdb_cs *, enum ovsdb_cs_state,
 #define ovsdb_cs_transition(CS, STATE) \
     ovsdb_cs_transition_at(CS, STATE, OVS_SOURCE_LOCATOR)
 
-static void ovsdb_cs_retry_at(struct ovsdb_cs *, const char *where);
-#define ovsdb_cs_retry(CS) ovsdb_cs_retry_at(CS, OVS_SOURCE_LOCATOR)
+static void ovsdb_cs_retry_at(struct ovsdb_cs *, bool graceful,
+                              const char *where);
+#define ovsdb_cs_retry(CS, GRACEFUL) \
+    ovsdb_cs_retry_at((CS), (GRACEFUL), OVS_SOURCE_LOCATOR)
 
 static struct vlog_rate_limit syntax_rl = VLOG_RATE_LIMIT_INIT(1, 5);
 
@@ -400,9 +402,21 @@  ovsdb_cs_send_request(struct ovsdb_cs *cs, struct jsonrpc_msg *request)
 }
 
 static void
-ovsdb_cs_retry_at(struct ovsdb_cs *cs, const char *where)
+ovsdb_cs_reconnect(struct ovsdb_cs *cs, bool graceful)
 {
-    ovsdb_cs_force_reconnect(cs);
+    if (cs->session) {
+        if (graceful) {
+            jsonrpc_session_graceful_reconnect(cs->session);
+        } else {
+            jsonrpc_session_force_reconnect(cs->session);
+        }
+    }
+}
+
+static void
+ovsdb_cs_retry_at(struct ovsdb_cs *cs, bool graceful, const char *where)
+{
+    ovsdb_cs_reconnect(cs, graceful);
     ovsdb_cs_transition_at(cs, CS_S_RETRY, where);
 }
 
@@ -438,7 +452,7 @@  ovsdb_cs_process_response(struct ovsdb_cs *cs, struct jsonrpc_msg *msg)
                      ovsdb_cs_state_to_string(cs->state),
                      s);
         free(s);
-        ovsdb_cs_retry(cs);
+        ovsdb_cs_retry(cs, false);
         return;
     }
 
@@ -711,9 +725,7 @@  ovsdb_cs_enable_reconnect(struct ovsdb_cs *cs)
 void
 ovsdb_cs_force_reconnect(struct ovsdb_cs *cs)
 {
-    if (cs->session) {
-        jsonrpc_session_force_reconnect(cs->session);
-    }
+    ovsdb_cs_reconnect(cs, false);
 }
 
 /* Drops 'cs''s current connection and the cached session.  This is useful if
@@ -722,7 +734,7 @@  void
 ovsdb_cs_flag_inconsistency(struct ovsdb_cs *cs)
 {
     cs->data.last_id = UUID_ZERO;
-    ovsdb_cs_retry(cs);
+    ovsdb_cs_retry(cs, false);
 }
 
 /* Returns true if 'cs' is currently connected or will eventually try to
@@ -1951,7 +1963,7 @@  ovsdb_cs_check_server_db(struct ovsdb_cs *cs)
 {
     bool ok = ovsdb_cs_check_server_db__(cs);
     if (!ok) {
-        ovsdb_cs_retry(cs);
+        ovsdb_cs_retry(cs, true);
     }
     return ok;
 }
diff --git a/lib/reconnect.c b/lib/reconnect.c
index a929ddfd2d..633e77bdb2 100644
--- a/lib/reconnect.c
+++ b/lib/reconnect.c
@@ -77,6 +77,7 @@  static void reconnect_transition__(struct reconnect *, long long int now,
                                    enum state state);
 static long long int reconnect_deadline__(const struct reconnect *);
 static bool reconnect_may_retry(struct reconnect *);
+static bool reconnect_reset_backoff__(struct reconnect *);
 
 static const char *
 reconnect_state_name__(enum state state)
@@ -330,6 +331,23 @@  reconnect_force_reconnect(struct reconnect *fsm, long long int now)
     }
 }
 
+/* If 'fsm' is enabled and currently connected (or attempting to connect),
+ * forces reconnect_run() for 'fsm' to return RECONNECT_DISCONNECT the next
+ * time it is called, which should cause the client to drop the connection (or
+ * attempt), back off, and then reconnect.
+ *
+ * Unlike reconnect_force_reconnect(), this also resets the backoff for
+ * connections that were active long enough.
+ * */
+void
+reconnect_graceful_reconnect(struct reconnect *fsm, long long int now)
+{
+    if (fsm->state & (S_CONNECTING | S_ACTIVE | S_IDLE)) {
+        reconnect_reset_backoff__(fsm);
+        reconnect_transition__(fsm, now, S_RECONNECT);
+    }
+}
+
 /* Tell 'fsm' that the connection dropped or that a connection attempt failed.
  * 'error' specifies the reason: a positive value represents an errno value,
  * EOF indicates that the connection was closed by the peer (e.g. read()
@@ -385,11 +403,7 @@  reconnect_disconnected(struct reconnect *fsm, long long int now, int error)
         if (fsm->backoff_free_tries > 1) {
             fsm->backoff_free_tries--;
             fsm->backoff = 0;
-        } else if (fsm->state & (S_ACTIVE | S_IDLE)
-                   && (fsm->last_activity - fsm->last_connected >= fsm->backoff
-                       || fsm->passive)) {
-            fsm->backoff = fsm->passive ? 0 : fsm->min_backoff;
-        } else {
+        } else if (!reconnect_reset_backoff__(fsm)) {
             if (fsm->backoff < fsm->min_backoff) {
                 fsm->backoff = fsm->min_backoff;
             } else if (fsm->backoff < fsm->max_backoff / 2) {
@@ -745,3 +759,15 @@  reconnect_may_retry(struct reconnect *fsm)
     }
     return may_retry;
 }
+
+static bool
+reconnect_reset_backoff__(struct reconnect *fsm)
+{
+    if (fsm->state & (S_ACTIVE | S_IDLE)
+        && (fsm->last_activity - fsm->last_connected >= fsm->backoff
+            || fsm->passive)) {
+        fsm->backoff = fsm->passive ? 0 : fsm->min_backoff;
+        return true;
+    }
+    return false;
+}
diff --git a/lib/reconnect.h b/lib/reconnect.h
index 40cc569c42..25bc627d66 100644
--- a/lib/reconnect.h
+++ b/lib/reconnect.h
@@ -67,6 +67,7 @@  void reconnect_enable(struct reconnect *, long long int now);
 void reconnect_disable(struct reconnect *, long long int now);
 
 void reconnect_force_reconnect(struct reconnect *, long long int now);
+void reconnect_graceful_reconnect(struct reconnect *, long long int now);
 void reconnect_skip_backoff(struct reconnect *);
 
 bool reconnect_is_connected(const struct reconnect *);
diff --git a/python/ovs/db/idl.py b/python/ovs/db/idl.py
index 889cf3431b..49bb964217 100644
--- a/python/ovs/db/idl.py
+++ b/python/ovs/db/idl.py
@@ -276,7 +276,7 @@  class Idl(object):
                                         tables=self.server_tables)
                     self.change_seqno = initial_change_seqno
                     if not self.__check_server_db():
-                        self.force_reconnect()
+                        self.graceful_reconnect()
                         break
                 else:
                     self.__parse_update(msg.params[1], OVSDB_UPDATE)
@@ -442,6 +442,12 @@  class Idl(object):
         In the meantime, the contents of the IDL will not change."""
         self._session.force_reconnect()
 
+    def graceful_reconnect(self):
+        """Makes the IDL to gracefully drop its connection to the database and
+        reconnect.  In the meantime, the contents of the IDL will not
+        change."""
+        self._session.graceful_reconnect()
+
     def session_name(self):
         return self._session.get_name()
 
diff --git a/python/ovs/jsonrpc.py b/python/ovs/jsonrpc.py
index bf32f8c87c..353a07fc49 100644
--- a/python/ovs/jsonrpc.py
+++ b/python/ovs/jsonrpc.py
@@ -612,5 +612,8 @@  class Session(object):
     def force_reconnect(self):
         self.reconnect.force_reconnect(ovs.timeval.msec())
 
+    def graceful_reconnect(self):
+        self.reconnect.graceful_reconnect(ovs.timeval.msec())
+
     def get_num_of_remotes(self):
         return len(self.remotes)
diff --git a/python/ovs/reconnect.py b/python/ovs/reconnect.py
index c4c6c87e9f..0f4a8e18da 100644
--- a/python/ovs/reconnect.py
+++ b/python/ovs/reconnect.py
@@ -325,6 +325,20 @@  class Reconnect(object):
                           Reconnect.Idle):
             self._transition(now, Reconnect.Reconnect)
 
+    def graceful_reconnect(self, now):
+        """If this FSM is enabled and currently connected (or attempting to
+        connect), forces self.run() to return ovs.reconnect.DISCONNECT the next
+        time it is called, which should cause the client to drop the connection
+        (or attempt), back off, and then reconnect.
+
+        Unlike self.force_reconnect(), this also updates backoff for long
+        lived sessions."""
+        if self.state in (Reconnect.ConnectInProgress,
+                          Reconnect.Active,
+                          Reconnect.Idle):
+            self._reset_backoff()
+            self._transition(now, Reconnect.Reconnect)
+
     def disconnected(self, now, error):
         """Tell this FSM that the connection dropped or that a connection
         attempt failed.  'error' specifies the reason: a positive value
@@ -377,14 +391,7 @@  class Reconnect(object):
             if self.backoff_free_tries > 1:
                 self.backoff_free_tries -= 1
                 self.backoff = 0
-            elif (self.state in (Reconnect.Active, Reconnect.Idle) and
-                (self.last_activity - self.last_connected >= self.backoff or
-                 self.passive)):
-                if self.passive:
-                    self.backoff = 0
-                else:
-                    self.backoff = self.min_backoff
-            else:
+            elif not self._reset_backoff():
                 if self.backoff < self.min_backoff:
                     self.backoff = self.min_backoff
                 elif self.backoff < self.max_backoff / 2:
@@ -623,3 +630,14 @@  class Reconnect(object):
             return True
         else:
             return False
+
+    def _reset_backoff(self):
+        if (self.state in (Reconnect.Active, Reconnect.Idle) and
+            (self.last_activity - self.last_connected >= self.backoff or
+             self.passive)):
+            if self.passive:
+                self.backoff = 0
+            else:
+                self.backoff = self.min_backoff
+            return True
+        return False
diff --git a/tests/reconnect.at b/tests/reconnect.at
index 0f74709f5a..884483df3d 100644
--- a/tests/reconnect.at
+++ b/tests/reconnect.at
@@ -1303,3 +1303,347 @@  run
 listening
   in LISTENING for 0 ms (0 ms backoff)
 ])
+
+######################################################################
+RECONNECT_CHECK([long connection doesn't reset backoff (force-reconnect)],
+  [enable
+
+# First connection attempt fails after 1000 ms.
+run
+connecting
+run
+timeout
+run
+connect-failed
+
+# Back off for 1000 ms.
+timeout
+run
+
+# Second connection attempt fails after 1000 ms.
+connecting
+timeout
+run
+connect-failed
+
+# Back off for 2000 ms.
+timeout
+run
+
+# Third connection attempt succeeds after 500 ms.
+connecting
+advance 500
+run
+connected
+
+# Connection receives 3 chunks of data spaced 2000 ms apart.
+advance 2000
+run
+activity
+advance 2000
+run
+activity
+advance 2000
+run
+activity
+
+# Forcefully reconnect.
+force-reconnect
+run
+connecting
+
+# Back off for 2000 ms.
+timeout
+
+# Connection succeeds.
+connected
+],
+  [### t=1000 ###
+enable
+  in BACKOFF for 0 ms (0 ms backoff)
+
+# First connection attempt fails after 1000 ms.
+run
+  should connect
+connecting
+  in CONNECTING for 0 ms (0 ms backoff)
+run
+timeout
+  advance 1000 ms
+
+### t=2000 ###
+  in CONNECTING for 1000 ms (0 ms backoff)
+run
+  should disconnect
+connect-failed
+  in BACKOFF for 0 ms (1000 ms backoff)
+  0 successful connections out of 1 attempts, seqno 0
+
+# Back off for 1000 ms.
+timeout
+  advance 1000 ms
+
+### t=3000 ###
+  in BACKOFF for 1000 ms (1000 ms backoff)
+run
+  should connect
+
+# Second connection attempt fails after 1000 ms.
+connecting
+  in CONNECTING for 0 ms (1000 ms backoff)
+timeout
+  advance 1000 ms
+
+### t=4000 ###
+  in CONNECTING for 1000 ms (1000 ms backoff)
+run
+  should disconnect
+connect-failed
+  in BACKOFF for 0 ms (2000 ms backoff)
+  0 successful connections out of 2 attempts, seqno 0
+
+# Back off for 2000 ms.
+timeout
+  advance 2000 ms
+
+### t=6000 ###
+  in BACKOFF for 2000 ms (2000 ms backoff)
+run
+  should connect
+
+# Third connection attempt succeeds after 500 ms.
+connecting
+  in CONNECTING for 0 ms (2000 ms backoff)
+advance 500
+
+### t=6500 ###
+  in CONNECTING for 500 ms (2000 ms backoff)
+run
+connected
+  in ACTIVE for 0 ms (2000 ms backoff)
+  created 1000, last activity 1000, last connected 6500
+  1 successful connections out of 3 attempts, seqno 1
+  connected
+  last connected 0 ms ago, connected 0 ms total
+
+# Connection receives 3 chunks of data spaced 2000 ms apart.
+advance 2000
+
+### t=8500 ###
+  in ACTIVE for 2000 ms (2000 ms backoff)
+run
+activity
+  created 1000, last activity 8500, last connected 6500
+advance 2000
+
+### t=10500 ###
+  in ACTIVE for 4000 ms (2000 ms backoff)
+run
+activity
+  created 1000, last activity 10500, last connected 6500
+advance 2000
+
+### t=12500 ###
+  in ACTIVE for 6000 ms (2000 ms backoff)
+run
+activity
+  created 1000, last activity 12500, last connected 6500
+
+# Forcefully reconnect.
+force-reconnect
+  in RECONNECT for 0 ms (2000 ms backoff)
+  1 successful connections out of 3 attempts, seqno 2
+  disconnected
+run
+  should disconnect
+connecting
+  in CONNECTING for 0 ms (2000 ms backoff)
+
+# Back off for 2000 ms.
+timeout
+  advance 2000 ms
+
+### t=14500 ###
+  in CONNECTING for 2000 ms (2000 ms backoff)
+  last connected 8000 ms ago, connected 6000 ms total
+
+# Connection succeeds.
+connected
+  in ACTIVE for 0 ms (2000 ms backoff)
+  created 1000, last activity 12500, last connected 14500
+  2 successful connections out of 4 attempts, seqno 3
+  connected
+  last connected 0 ms ago, connected 6000 ms total
+])
+
+######################################################################
+RECONNECT_CHECK([long connection resets backoff (graceful-reconnect)],
+  [enable
+
+# First connection attempt fails after 1000 ms.
+run
+connecting
+run
+timeout
+run
+connect-failed
+
+# Back off for 1000 ms.
+timeout
+run
+
+# Second connection attempt fails after 1000 ms.
+connecting
+timeout
+run
+connect-failed
+
+# Back off for 2000 ms.
+timeout
+run
+
+# Third connection attempt succeeds after 500 ms.
+connecting
+advance 500
+run
+connected
+
+# Connection receives 3 chunks of data spaced 2000 ms apart.
+advance 2000
+run
+activity
+advance 2000
+run
+activity
+advance 2000
+run
+activity
+
+# Gracefully reconnect.
+graceful-reconnect
+run
+connecting
+
+# Back off for 1000 ms.
+timeout
+
+# Connection succeeds.
+connected
+],
+  [### t=1000 ###
+enable
+  in BACKOFF for 0 ms (0 ms backoff)
+
+# First connection attempt fails after 1000 ms.
+run
+  should connect
+connecting
+  in CONNECTING for 0 ms (0 ms backoff)
+run
+timeout
+  advance 1000 ms
+
+### t=2000 ###
+  in CONNECTING for 1000 ms (0 ms backoff)
+run
+  should disconnect
+connect-failed
+  in BACKOFF for 0 ms (1000 ms backoff)
+  0 successful connections out of 1 attempts, seqno 0
+
+# Back off for 1000 ms.
+timeout
+  advance 1000 ms
+
+### t=3000 ###
+  in BACKOFF for 1000 ms (1000 ms backoff)
+run
+  should connect
+
+# Second connection attempt fails after 1000 ms.
+connecting
+  in CONNECTING for 0 ms (1000 ms backoff)
+timeout
+  advance 1000 ms
+
+### t=4000 ###
+  in CONNECTING for 1000 ms (1000 ms backoff)
+run
+  should disconnect
+connect-failed
+  in BACKOFF for 0 ms (2000 ms backoff)
+  0 successful connections out of 2 attempts, seqno 0
+
+# Back off for 2000 ms.
+timeout
+  advance 2000 ms
+
+### t=6000 ###
+  in BACKOFF for 2000 ms (2000 ms backoff)
+run
+  should connect
+
+# Third connection attempt succeeds after 500 ms.
+connecting
+  in CONNECTING for 0 ms (2000 ms backoff)
+advance 500
+
+### t=6500 ###
+  in CONNECTING for 500 ms (2000 ms backoff)
+run
+connected
+  in ACTIVE for 0 ms (2000 ms backoff)
+  created 1000, last activity 1000, last connected 6500
+  1 successful connections out of 3 attempts, seqno 1
+  connected
+  last connected 0 ms ago, connected 0 ms total
+
+# Connection receives 3 chunks of data spaced 2000 ms apart.
+advance 2000
+
+### t=8500 ###
+  in ACTIVE for 2000 ms (2000 ms backoff)
+run
+activity
+  created 1000, last activity 8500, last connected 6500
+advance 2000
+
+### t=10500 ###
+  in ACTIVE for 4000 ms (2000 ms backoff)
+run
+activity
+  created 1000, last activity 10500, last connected 6500
+advance 2000
+
+### t=12500 ###
+  in ACTIVE for 6000 ms (2000 ms backoff)
+run
+activity
+  created 1000, last activity 12500, last connected 6500
+
+# Gracefully reconnect.
+graceful-reconnect
+  in RECONNECT for 0 ms (1000 ms backoff)
+  1 successful connections out of 3 attempts, seqno 2
+  disconnected
+run
+  should disconnect
+connecting
+  in CONNECTING for 0 ms (1000 ms backoff)
+
+# Back off for 1000 ms.
+timeout
+  advance 1000 ms
+
+### t=13500 ###
+  in CONNECTING for 1000 ms (1000 ms backoff)
+  last connected 7000 ms ago, connected 6000 ms total
+
+# Connection succeeds.
+connected
+  in ACTIVE for 0 ms (1000 ms backoff)
+  created 1000, last activity 12500, last connected 13500
+  2 successful connections out of 4 attempts, seqno 3
+  connected
+  last connected 0 ms ago, connected 6000 ms total
+])
diff --git a/tests/test-reconnect.c b/tests/test-reconnect.c
index c84bb1cdbf..cabf228f84 100644
--- a/tests/test-reconnect.c
+++ b/tests/test-reconnect.c
@@ -108,6 +108,12 @@  do_force_reconnect(struct ovs_cmdl_context *ctx OVS_UNUSED)
     reconnect_force_reconnect(reconnect, now);
 }
 
+static void
+do_graceful_reconnect(struct ovs_cmdl_context *ctx OVS_UNUSED)
+{
+    reconnect_graceful_reconnect(reconnect, now);
+}
+
 static int
 error_from_string(const char *s)
 {
@@ -293,6 +299,7 @@  static const struct ovs_cmdl_command all_commands[] = {
     { "enable", NULL, 0, 0, do_enable, OVS_RO },
     { "disable", NULL, 0, 0, do_disable, OVS_RO },
     { "force-reconnect", NULL, 0, 0, do_force_reconnect, OVS_RO },
+    { "graceful-reconnect", NULL, 0, 0, do_graceful_reconnect, OVS_RO },
     { "disconnected", NULL, 0, 1, do_disconnected, OVS_RO },
     { "connecting", NULL, 0, 0, do_connecting, OVS_RO },
     { "connect-failed", NULL, 0, 1, do_connect_failed, OVS_RO },
diff --git a/tests/test-reconnect.py b/tests/test-reconnect.py
index cea48eb527..b397eac656 100644
--- a/tests/test-reconnect.py
+++ b/tests/test-reconnect.py
@@ -33,6 +33,10 @@  def do_force_reconnect(_):
     r.force_reconnect(now)
 
 
+def do_graceful_reconnect(_):
+    r.graceful_reconnect(now)
+
+
 def error_from_string(s):
     if not s:
         return 0
@@ -176,6 +180,7 @@  def main():
         "enable": do_enable,
         "disable": do_disable,
         "force-reconnect": do_force_reconnect,
+        "graceful-reconnect": do_graceful_reconnect,
         "disconnected": do_disconnected,
         "connecting": do_connecting,
         "connect-failed": do_connect_failed,