From patchwork Thu Nov 19 00:55:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Pfaff X-Patchwork-Id: 1402668 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=140.211.166.137; helo=fraxinus.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ovn.org Received: from fraxinus.osuosl.org (smtp4.osuosl.org [140.211.166.137]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4Cc1Sb3tY3z9sTc for ; Thu, 19 Nov 2020 11:55:41 +1100 (AEDT) Received: from localhost (localhost [127.0.0.1]) by fraxinus.osuosl.org (Postfix) with ESMTP id D61ED86148; Thu, 19 Nov 2020 00:55:37 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from fraxinus.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vpADHzGzscFx; Thu, 19 Nov 2020 00:55:36 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by fraxinus.osuosl.org (Postfix) with ESMTP id E66F385D17; Thu, 19 Nov 2020 00:55:36 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id B672CC0891; Thu, 19 Nov 2020 00:55:36 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from hemlock.osuosl.org (smtp2.osuosl.org [140.211.166.133]) by lists.linuxfoundation.org (Postfix) with ESMTP id 3B012C07FF for ; Thu, 19 Nov 2020 00:55:35 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by hemlock.osuosl.org (Postfix) with ESMTP id 2E6D58729A for ; Thu, 19 Nov 2020 00:55:35 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from hemlock.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tH5jE+RD7A3h for ; Thu, 19 Nov 2020 00:55:34 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from relay4-d.mail.gandi.net (relay4-d.mail.gandi.net [217.70.183.196]) by hemlock.osuosl.org (Postfix) with ESMTPS id 8CF9687286 for ; Thu, 19 Nov 2020 00:55:33 +0000 (UTC) X-Originating-IP: 75.54.222.30 Received: from sigfpe.attlocal.net (75-54-222-30.lightspeed.rdcyca.sbcglobal.net [75.54.222.30]) (Authenticated sender: blp@ovn.org) by relay4-d.mail.gandi.net (Postfix) with ESMTPSA id 17511E0004; Thu, 19 Nov 2020 00:55:28 +0000 (UTC) From: Ben Pfaff To: dev@openvswitch.org Date: Wed, 18 Nov 2020 16:55:22 -0800 Message-Id: <20201119005522.526362-1-blp@ovn.org> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Cc: Ben Pfaff Subject: [ovs-dev] [PATCH] jsonrpc: Avoid disconnecting prematurely due to long poll intervals. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" Open vSwitch has a few different jsonrpc-based protocols that depend on jsonrpc_session to make sure that the connection is up and working. In turn, jsonrpc_session uses the "reconnect" state machine to send probes if nothing is received. This works fine in normal circumstances. In unusual circumstances, though, it can happen that the program is busy and doesn't even try to receive anything for a long time. Then the timer can time out without a good reason; if it had tried to receive something, it would have. There's a solution that the clients of jsonrpc_session could adopt. Instead of first calling jsonrpc_session_run(), which is what calls into "reconnect" to deal with timing out, and then calling into jsonrpc_session_recv(), which is what tries to receive something, they could use the opposite order. That would make sure that the timeout was always based on a recent attempt to receive something. Great. The actual code in OVS that uses jsonrpc_session, though, tends to use the opposite order, and there are enough users and this is a subtle enough issue that it could get flipped back around even if we fixed it now. So this commit takes a different approach. Instead of fixing this in the users of jsonrpc_session, we fix it in the users of reconnect: make them tell when they've tried to receive something (or disable this particular feature). This commit fixes the problem that way. It's kind of hard to reproduce but I'm pretty sure that I've seen it a number of times in testing. Signed-off-by: Ben Pfaff --- lib/jsonrpc.c | 5 ++++- lib/reconnect.c | 25 +++++++++++++++++++++++-- lib/reconnect.h | 1 + tests/test-reconnect.c | 1 + 4 files changed, 29 insertions(+), 3 deletions(-) diff --git a/lib/jsonrpc.c b/lib/jsonrpc.c index ecbc939fe7f9..b3de8c3a8b93 100644 --- a/lib/jsonrpc.c +++ b/lib/jsonrpc.c @@ -1112,13 +1112,16 @@ jsonrpc_session_recv(struct jsonrpc_session *s) received_bytes = jsonrpc_get_received_bytes(s->rpc); jsonrpc_recv(s->rpc, &msg); + + long long int now = time_msec(); + reconnect_try_receive(s->reconnect, now); if (received_bytes != jsonrpc_get_received_bytes(s->rpc)) { /* Data was successfully received. * * Previously we only counted receiving a full message as activity, * but with large messages or a slow connection that policy could * time out the session mid-message. */ - reconnect_activity(s->reconnect, time_msec()); + reconnect_activity(s->reconnect, now); } if (msg) { diff --git a/lib/reconnect.c b/lib/reconnect.c index c89abab8894a..57d879ce43a1 100644 --- a/lib/reconnect.c +++ b/lib/reconnect.c @@ -61,6 +61,7 @@ struct reconnect { long long int last_activity; long long int last_connected; long long int last_disconnected; + long long int last_receive_attempt; unsigned int max_tries; unsigned int backoff_free_tries; @@ -109,6 +110,7 @@ reconnect_create(long long int now) fsm->last_activity = now; fsm->last_connected = LLONG_MAX; fsm->last_disconnected = LLONG_MAX; + fsm->last_receive_attempt = now; fsm->max_tries = UINT_MAX; fsm->creation_time = now; @@ -501,6 +503,19 @@ reconnect_activity(struct reconnect *fsm, long long int now) fsm->last_activity = now; } +/* Tell 'fsm' that some attempt to receive data on the connection was made at + * 'now'. The FSM only allows probe interval timer to expire when some attempt + * to receive data on the connection was received after the time when it should + * have expired. This helps in the case where there's a long delay in the poll + * loop and then reconnect_run() executes before the code to try to receive + * anything from the remote runs. (To disable this feature, just call + * reconnect_try_receive(fsm, LLONG_MAX).) */ +void +reconnect_try_receive(struct reconnect *fsm, long long int now) +{ + fsm->last_receive_attempt = now; +} + static void reconnect_transition__(struct reconnect *fsm, long long int now, enum state state) @@ -541,13 +556,19 @@ reconnect_deadline__(const struct reconnect *fsm) case S_ACTIVE: if (fsm->probe_interval) { long long int base = MAX(fsm->last_activity, fsm->state_entered); - return base + fsm->probe_interval; + long long int expiration = base + fsm->probe_interval; + if (fsm->last_receive_attempt >= expiration) { + return expiration; + } } return LLONG_MAX; case S_IDLE: if (fsm->probe_interval) { - return fsm->state_entered + fsm->probe_interval; + long long int expiration = fsm->state_entered + fsm->probe_interval; + if (fsm->last_receive_attempt >= expiration) { + return expiration; + } } return LLONG_MAX; diff --git a/lib/reconnect.h b/lib/reconnect.h index 9f2d469e2ddd..de56400c5ef4 100644 --- a/lib/reconnect.h +++ b/lib/reconnect.h @@ -83,6 +83,7 @@ void reconnect_connected(struct reconnect *, long long int now); void reconnect_connect_failed(struct reconnect *, long long int now, int error); void reconnect_activity(struct reconnect *, long long int now); +void reconnect_try_receive(struct reconnect *, long long int now); enum reconnect_action { RECONNECT_CONNECT = 1, diff --git a/tests/test-reconnect.c b/tests/test-reconnect.c index 5a14e7fe58da..ebecb02c7ea9 100644 --- a/tests/test-reconnect.c +++ b/tests/test-reconnect.c @@ -48,6 +48,7 @@ test_reconnect_main(int argc OVS_UNUSED, char *argv[] OVS_UNUSED) now = 1000; reconnect = reconnect_create(now); + reconnect_try_receive(reconnect, LLONG_MAX); reconnect_set_name(reconnect, "remote"); reconnect_get_stats(reconnect, now, &prev); printf("### t=%d ###\n", now);