diff mbox series

[ovs-dev,v3,1/9] jsonrpc: Avoid disconnecting prematurely due to long poll intervals.

Message ID 20201218053144.2637583-2-blp@ovn.org
State Accepted
Headers show
Series Refactor OVSDB IDL into two layers | expand

Commit Message

Ben Pfaff Dec. 18, 2020, 5:31 a.m. UTC
Open vSwitch has a few different jsonrpc-based protocols that depend on
jsonrpc_session to make sure that the connection is up and working.
In turn, jsonrpc_session uses the "reconnect" state machine to send
probes if nothing is received.  This works fine in normal circumstances.
In unusual circumstances, though, it can happen that the program is
busy and doesn't even try to receive anything for a long time.  Then the
timer can time out without a good reason; if it had tried to receive
something, it would have.

There's a solution that the clients of jsonrpc_session could adopt.
Instead of first calling jsonrpc_session_run(), which is what calls into
"reconnect" to deal with timing out, and then calling into
jsonrpc_session_recv(), which is what tries to receive something, they
could use the opposite order.  That would make sure that the timeout
was always based on a recent attempt to receive something.  Great.

The actual code in OVS that uses jsonrpc_session, though, tends to use
the opposite order, and there are enough users and this is a subtle
enough issue that it could get flipped back around even if we fixed it
now.  So this commit takes a different approach.  Instead of fixing
this in the users of jsonrpc_session, we fix it in the users of
reconnect: make them tell when they've tried to receive something (or
disable this particular feature).

This commit fixes the problem that way.  It's kind of hard to reproduce
but I'm pretty sure that I've seen it a number of times in testing.

Signed-off-by: Ben Pfaff <blp@ovn.org>
---
 lib/jsonrpc.c          |  5 ++++-
 lib/reconnect.c        | 25 +++++++++++++++++++++++--
 lib/reconnect.h        |  1 +
 tests/test-reconnect.c |  1 +
 4 files changed, 29 insertions(+), 3 deletions(-)

Comments

0-day Robot Dec. 18, 2020, 6 a.m. UTC | #1
Bleep bloop.  Greetings Ben Pfaff, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 80 characters long (recommended limit is 79)
#118 FILE: lib/reconnect.c:568:
            long long int expiration = fsm->state_entered + fsm->probe_interval;

Lines checked: 151, Warnings: 1, Errors: 0


Please check this out.  If you feel there has been an error, please email aconole@redhat.com

Thanks,
0-day Robot
Ilya Maximets Dec. 18, 2020, 9:11 p.m. UTC | #2
On 12/18/20 6:31 AM, Ben Pfaff wrote:
> Open vSwitch has a few different jsonrpc-based protocols that depend on
> jsonrpc_session to make sure that the connection is up and working.
> In turn, jsonrpc_session uses the "reconnect" state machine to send
> probes if nothing is received.  This works fine in normal circumstances.
> In unusual circumstances, though, it can happen that the program is
> busy and doesn't even try to receive anything for a long time.  Then the
> timer can time out without a good reason; if it had tried to receive
> something, it would have.
> 
> There's a solution that the clients of jsonrpc_session could adopt.
> Instead of first calling jsonrpc_session_run(), which is what calls into
> "reconnect" to deal with timing out, and then calling into
> jsonrpc_session_recv(), which is what tries to receive something, they
> could use the opposite order.  That would make sure that the timeout
> was always based on a recent attempt to receive something.  Great.
> 
> The actual code in OVS that uses jsonrpc_session, though, tends to use
> the opposite order, and there are enough users and this is a subtle
> enough issue that it could get flipped back around even if we fixed it
> now.  So this commit takes a different approach.  Instead of fixing
> this in the users of jsonrpc_session, we fix it in the users of
> reconnect: make them tell when they've tried to receive something (or
> disable this particular feature).
> 
> This commit fixes the problem that way.  It's kind of hard to reproduce
> but I'm pretty sure that I've seen it a number of times in testing.
> 
> Signed-off-by: Ben Pfaff <blp@ovn.org>
> ---
>  lib/jsonrpc.c          |  5 ++++-
>  lib/reconnect.c        | 25 +++++++++++++++++++++++--
>  lib/reconnect.h        |  1 +
>  tests/test-reconnect.c |  1 +
>  4 files changed, 29 insertions(+), 3 deletions(-)

Thanks for the fix!

I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/
or some other name since the 'try_receive' part makes me think that
we're actually asking it to receive something.

It might also be good to add a specific unit test to test-reconnect
that will check that timeout is not set.  This should not be hard, but,
I guess, this could be a separate patch.

For this patch with or without above changes:

Acked-by: Ilya Maximets <i.maximets@ovn.org>

It might make sense to backport this to 2.13 as OVN users might hit
this issue on a highly loaded cluster.

Best regards, Ilya Maximets.
Ben Pfaff Dec. 19, 2020, 2:11 a.m. UTC | #3
On Fri, Dec 18, 2020 at 10:11:15PM +0100, Ilya Maximets wrote:
> On 12/18/20 6:31 AM, Ben Pfaff wrote:
> > Open vSwitch has a few different jsonrpc-based protocols that depend on
> > jsonrpc_session to make sure that the connection is up and working.
> > In turn, jsonrpc_session uses the "reconnect" state machine to send
> > probes if nothing is received.  This works fine in normal circumstances.
> > In unusual circumstances, though, it can happen that the program is
> > busy and doesn't even try to receive anything for a long time.  Then the
> > timer can time out without a good reason; if it had tried to receive
> > something, it would have.
> > 
> > There's a solution that the clients of jsonrpc_session could adopt.
> > Instead of first calling jsonrpc_session_run(), which is what calls into
> > "reconnect" to deal with timing out, and then calling into
> > jsonrpc_session_recv(), which is what tries to receive something, they
> > could use the opposite order.  That would make sure that the timeout
> > was always based on a recent attempt to receive something.  Great.
> > 
> > The actual code in OVS that uses jsonrpc_session, though, tends to use
> > the opposite order, and there are enough users and this is a subtle
> > enough issue that it could get flipped back around even if we fixed it
> > now.  So this commit takes a different approach.  Instead of fixing
> > this in the users of jsonrpc_session, we fix it in the users of
> > reconnect: make them tell when they've tried to receive something (or
> > disable this particular feature).
> > 
> > This commit fixes the problem that way.  It's kind of hard to reproduce
> > but I'm pretty sure that I've seen it a number of times in testing.
> > 
> > Signed-off-by: Ben Pfaff <blp@ovn.org>
> > ---
> >  lib/jsonrpc.c          |  5 ++++-
> >  lib/reconnect.c        | 25 +++++++++++++++++++++++--
> >  lib/reconnect.h        |  1 +
> >  tests/test-reconnect.c |  1 +
> >  4 files changed, 29 insertions(+), 3 deletions(-)
> 
> Thanks for the fix!
> 
> I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/
> or some other name since the 'try_receive' part makes me think that
> we're actually asking it to receive something.

I am sorry that I forgot about that request from the OVN meeting.  I
made that change just now.

I decided to call it reconnect_receive_attempted(), to indicate that the
received attempt has already happened.

> It might also be good to add a specific unit test to test-reconnect
> that will check that timeout is not set.  This should not be hard, but,
> I guess, this could be a separate patch.

I wasn't sure how to test this.  Could you explain your idea?  I didn't
quite follow.

> For this patch with or without above changes:
> 
> Acked-by: Ilya Maximets <i.maximets@ovn.org>
.
> It might make sense to backport this to 2.13 as OVN users might hit
> this issue on a highly loaded cluster.

Thanks.  I applied this to master and branch-2.13.
Ilya Maximets Dec. 21, 2020, 11:58 a.m. UTC | #4
On 12/19/20 3:11 AM, Ben Pfaff wrote:
> On Fri, Dec 18, 2020 at 10:11:15PM +0100, Ilya Maximets wrote:
>> On 12/18/20 6:31 AM, Ben Pfaff wrote:
>>> Open vSwitch has a few different jsonrpc-based protocols that depend on
>>> jsonrpc_session to make sure that the connection is up and working.
>>> In turn, jsonrpc_session uses the "reconnect" state machine to send
>>> probes if nothing is received.  This works fine in normal circumstances.
>>> In unusual circumstances, though, it can happen that the program is
>>> busy and doesn't even try to receive anything for a long time.  Then the
>>> timer can time out without a good reason; if it had tried to receive
>>> something, it would have.
>>>
>>> There's a solution that the clients of jsonrpc_session could adopt.
>>> Instead of first calling jsonrpc_session_run(), which is what calls into
>>> "reconnect" to deal with timing out, and then calling into
>>> jsonrpc_session_recv(), which is what tries to receive something, they
>>> could use the opposite order.  That would make sure that the timeout
>>> was always based on a recent attempt to receive something.  Great.
>>>
>>> The actual code in OVS that uses jsonrpc_session, though, tends to use
>>> the opposite order, and there are enough users and this is a subtle
>>> enough issue that it could get flipped back around even if we fixed it
>>> now.  So this commit takes a different approach.  Instead of fixing
>>> this in the users of jsonrpc_session, we fix it in the users of
>>> reconnect: make them tell when they've tried to receive something (or
>>> disable this particular feature).
>>>
>>> This commit fixes the problem that way.  It's kind of hard to reproduce
>>> but I'm pretty sure that I've seen it a number of times in testing.
>>>
>>> Signed-off-by: Ben Pfaff <blp@ovn.org>
>>> ---
>>>  lib/jsonrpc.c          |  5 ++++-
>>>  lib/reconnect.c        | 25 +++++++++++++++++++++++--
>>>  lib/reconnect.h        |  1 +
>>>  tests/test-reconnect.c |  1 +
>>>  4 files changed, 29 insertions(+), 3 deletions(-)
>>
>> Thanks for the fix!
>>
>> I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/
>> or some other name since the 'try_receive' part makes me think that
>> we're actually asking it to receive something.
> 
> I am sorry that I forgot about that request from the OVN meeting.  I
> made that change just now.
> 
> I decided to call it reconnect_receive_attempted(), to indicate that the
> received attempt has already happened.
> 
>> It might also be good to add a specific unit test to test-reconnect
>> that will check that timeout is not set.  This should not be hard, but,
>> I guess, this could be a separate patch.
> 
> I wasn't sure how to test this.  Could you explain your idea?  I didn't
> quite follow.

I mean pure reconnect test in tests/reconnect.at.
Idea is to add new command 'do_receive_attempted' to test-reconnect.c and
check that in pair with 'advance' we will have particular timeouts or
'no timeout' if there was no receive attempted.  Does that make sense?

BTW, do we need the same fix for the python 'reconnect' library?

> 
>> For this patch with or without above changes:
>>
>> Acked-by: Ilya Maximets <i.maximets@ovn.org>
> .
>> It might make sense to backport this to 2.13 as OVN users might hit
>> this issue on a highly loaded cluster.
> 
> Thanks.  I applied this to master and branch-2.13.

We should, probably, apply this to 2.14 too to avoid feature/fix gap
during upgrade.

Best regards, Ilya Maximets.
Ben Pfaff Dec. 22, 2020, 12:22 a.m. UTC | #5
On Mon, Dec 21, 2020 at 12:58:14PM +0100, Ilya Maximets wrote:
> On 12/19/20 3:11 AM, Ben Pfaff wrote:
> > On Fri, Dec 18, 2020 at 10:11:15PM +0100, Ilya Maximets wrote:
> >> On 12/18/20 6:31 AM, Ben Pfaff wrote:
> >>> Open vSwitch has a few different jsonrpc-based protocols that depend on
> >>> jsonrpc_session to make sure that the connection is up and working.
> >>> In turn, jsonrpc_session uses the "reconnect" state machine to send
> >>> probes if nothing is received.  This works fine in normal circumstances.
> >>> In unusual circumstances, though, it can happen that the program is
> >>> busy and doesn't even try to receive anything for a long time.  Then the
> >>> timer can time out without a good reason; if it had tried to receive
> >>> something, it would have.
> >>>
> >>> There's a solution that the clients of jsonrpc_session could adopt.
> >>> Instead of first calling jsonrpc_session_run(), which is what calls into
> >>> "reconnect" to deal with timing out, and then calling into
> >>> jsonrpc_session_recv(), which is what tries to receive something, they
> >>> could use the opposite order.  That would make sure that the timeout
> >>> was always based on a recent attempt to receive something.  Great.
> >>>
> >>> The actual code in OVS that uses jsonrpc_session, though, tends to use
> >>> the opposite order, and there are enough users and this is a subtle
> >>> enough issue that it could get flipped back around even if we fixed it
> >>> now.  So this commit takes a different approach.  Instead of fixing
> >>> this in the users of jsonrpc_session, we fix it in the users of
> >>> reconnect: make them tell when they've tried to receive something (or
> >>> disable this particular feature).
> >>>
> >>> This commit fixes the problem that way.  It's kind of hard to reproduce
> >>> but I'm pretty sure that I've seen it a number of times in testing.
> >>>
> >>> Signed-off-by: Ben Pfaff <blp@ovn.org>
> >>> ---
> >>>  lib/jsonrpc.c          |  5 ++++-
> >>>  lib/reconnect.c        | 25 +++++++++++++++++++++++--
> >>>  lib/reconnect.h        |  1 +
> >>>  tests/test-reconnect.c |  1 +
> >>>  4 files changed, 29 insertions(+), 3 deletions(-)
> >>
> >> Thanks for the fix!
> >>
> >> I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/
> >> or some other name since the 'try_receive' part makes me think that
> >> we're actually asking it to receive something.
> > 
> > I am sorry that I forgot about that request from the OVN meeting.  I
> > made that change just now.
> > 
> > I decided to call it reconnect_receive_attempted(), to indicate that the
> > received attempt has already happened.
> > 
> >> It might also be good to add a specific unit test to test-reconnect
> >> that will check that timeout is not set.  This should not be hard, but,
> >> I guess, this could be a separate patch.
> > 
> > I wasn't sure how to test this.  Could you explain your idea?  I didn't
> > quite follow.
> 
> I mean pure reconnect test in tests/reconnect.at.
> Idea is to add new command 'do_receive_attempted' to test-reconnect.c and
> check that in pair with 'advance' we will have particular timeouts or
> 'no timeout' if there was no receive attempted.  Does that make sense?

OK.

I sent a patch:
https://mail.openvswitch.org/pipermail/ovs-dev/2020-December/378965.html

> BTW, do we need the same fix for the python 'reconnect' library?

I added a Python implementation in the same patch.

> >> for this patch with or without above changes:
> >>
> >> Acked-by: Ilya Maximets <i.maximets@ovn.org>
> > .
> >> It might make sense to backport this to 2.13 as OVN users might hit
> >> this issue on a highly loaded cluster.
> > 
> > Thanks.  I applied this to master and branch-2.13.
> 
> We should, probably, apply this to 2.14 too to avoid feature/fix gap
> during upgrade.

I forgot about 2.14.  Thanks, I did that backport (forward-port?) too
now.
diff mbox series

Patch

diff --git a/lib/jsonrpc.c b/lib/jsonrpc.c
index 08aaff061ae5..46e5ea523685 100644
--- a/lib/jsonrpc.c
+++ b/lib/jsonrpc.c
@@ -1155,13 +1155,16 @@  jsonrpc_session_recv(struct jsonrpc_session *s)
 
         received_bytes = jsonrpc_get_received_bytes(s->rpc);
         jsonrpc_recv(s->rpc, &msg);
+
+        long long int now = time_msec();
+        reconnect_try_receive(s->reconnect, now);
         if (received_bytes != jsonrpc_get_received_bytes(s->rpc)) {
             /* Data was successfully received.
              *
              * Previously we only counted receiving a full message as activity,
              * but with large messages or a slow connection that policy could
              * time out the session mid-message. */
-            reconnect_activity(s->reconnect, time_msec());
+            reconnect_activity(s->reconnect, now);
         }
 
         if (msg) {
diff --git a/lib/reconnect.c b/lib/reconnect.c
index c89abab8894a..57d879ce43a1 100644
--- a/lib/reconnect.c
+++ b/lib/reconnect.c
@@ -61,6 +61,7 @@  struct reconnect {
     long long int last_activity;
     long long int last_connected;
     long long int last_disconnected;
+    long long int last_receive_attempt;
     unsigned int max_tries;
     unsigned int backoff_free_tries;
 
@@ -109,6 +110,7 @@  reconnect_create(long long int now)
     fsm->last_activity = now;
     fsm->last_connected = LLONG_MAX;
     fsm->last_disconnected = LLONG_MAX;
+    fsm->last_receive_attempt = now;
     fsm->max_tries = UINT_MAX;
     fsm->creation_time = now;
 
@@ -501,6 +503,19 @@  reconnect_activity(struct reconnect *fsm, long long int now)
     fsm->last_activity = now;
 }
 
+/* Tell 'fsm' that some attempt to receive data on the connection was made at
+ * 'now'.  The FSM only allows probe interval timer to expire when some attempt
+ * to receive data on the connection was received after the time when it should
+ * have expired.  This helps in the case where there's a long delay in the poll
+ * loop and then reconnect_run() executes before the code to try to receive
+ * anything from the remote runs.  (To disable this feature, just call
+ * reconnect_try_receive(fsm, LLONG_MAX).) */
+void
+reconnect_try_receive(struct reconnect *fsm, long long int now)
+{
+    fsm->last_receive_attempt = now;
+}
+
 static void
 reconnect_transition__(struct reconnect *fsm, long long int now,
                        enum state state)
@@ -541,13 +556,19 @@  reconnect_deadline__(const struct reconnect *fsm)
     case S_ACTIVE:
         if (fsm->probe_interval) {
             long long int base = MAX(fsm->last_activity, fsm->state_entered);
-            return base + fsm->probe_interval;
+            long long int expiration = base + fsm->probe_interval;
+            if (fsm->last_receive_attempt >= expiration) {
+                return expiration;
+            }
         }
         return LLONG_MAX;
 
     case S_IDLE:
         if (fsm->probe_interval) {
-            return fsm->state_entered + fsm->probe_interval;
+            long long int expiration = fsm->state_entered + fsm->probe_interval;
+            if (fsm->last_receive_attempt >= expiration) {
+                return expiration;
+            }
         }
         return LLONG_MAX;
 
diff --git a/lib/reconnect.h b/lib/reconnect.h
index 9f2d469e2ddd..de56400c5ef4 100644
--- a/lib/reconnect.h
+++ b/lib/reconnect.h
@@ -83,6 +83,7 @@  void reconnect_connected(struct reconnect *, long long int now);
 void reconnect_connect_failed(struct reconnect *, long long int now,
                               int error);
 void reconnect_activity(struct reconnect *, long long int now);
+void reconnect_try_receive(struct reconnect *, long long int now);
 
 enum reconnect_action {
     RECONNECT_CONNECT = 1,
diff --git a/tests/test-reconnect.c b/tests/test-reconnect.c
index 5a14e7fe58da..ebecb02c7ea9 100644
--- a/tests/test-reconnect.c
+++ b/tests/test-reconnect.c
@@ -48,6 +48,7 @@  test_reconnect_main(int argc OVS_UNUSED, char *argv[] OVS_UNUSED)
 
     now = 1000;
     reconnect = reconnect_create(now);
+    reconnect_try_receive(reconnect, LLONG_MAX);
     reconnect_set_name(reconnect, "remote");
     reconnect_get_stats(reconnect, now, &prev);
     printf("### t=%d ###\n", now);