From patchwork Fri Apr 12 23:26:28 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Zhou X-Patchwork-Id: 1084993 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="X+ToSiQZ"; dkim-atps=neutral Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 44gvHx4JDdz9s6w for ; Sat, 13 Apr 2019 09:29:49 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 8AB3F18F2; Fri, 12 Apr 2019 23:26:34 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 2D45C1882 for ; Fri, 12 Apr 2019 23:26:31 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 1E1A985B for ; Fri, 12 Apr 2019 23:26:30 +0000 (UTC) Received: by mail-pl1-f179.google.com with SMTP id cv12so5828841plb.9 for ; Fri, 12 Apr 2019 16:26:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=Q10nYSJdpb502bz+7hCTaThapVFEvFb58qDJtIZsKgU=; b=X+ToSiQZ2DXd6OJNlb7TEU/2ruSuYrT1C2Awb62ZWb6xHl7OipWvvlMxBiM+9uFsQp Rwmmw25ZWUcjXHi/DgL1zJPf/8N1sQLrsyw8XFFEPoto1040rsgX4mgMiUbjjS8nMGA+ RRqAmsrCwPFBJ8EOhzl1xnQVDpEJzkcUnESG1pC87GQopxBdzBBfRu8jKFxpgrNpOI8f C7kNZaFLFgTJR5Rf5kVE7yLgT7xTzkBFQseQHjXc2z4BMN3aRHIXgoBc4Fry32jzOE2V /w0uYBdKj19KF4Mu8y74VE18o/aE/YIKprCiRgHfgfNs98mpuEjV2yviaCWxlBzQEGdd KNtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=Q10nYSJdpb502bz+7hCTaThapVFEvFb58qDJtIZsKgU=; b=RGd9msTsjdcZ2KahLr6Ene/1lg196/0PFR3Q/u0nEfHQIqlnfADiB9UCFAB4GxuyEA hYyZoyfSCtOFB5ilMo53GyP4mupL9W5ueuaNr2wlEK8pyeYxlpQBzBcYUqm4FlQl7IS8 ZckHAcTKhI5JIw6mm8vNDo+Aqv5tkgHQiD4nSwEcDCnvMgGLreSnZtjtQp65/oC8Zsz5 TcBlA1Poe/5MuTBUIO5mD1ycxxTz8YACM+RdyKlhoqbTsBnoBdmKSrq7+r+2ORwFNLJ8 iVQRyhYWfcb0ckGwB/2r/0nuq/uqami1ip3AZVEPdkNMXaZzzniM4qb6GscqmKyqZUtn sV5w== X-Gm-Message-State: APjAAAUfTFzZ1D+QF3Z1uTMu4PNJUzYP9FFV6oz6CNdpCxdLkFKN4iBX I28LxHlrYQ6cJ7YywjJxSeRC1tC6 X-Google-Smtp-Source: APXvYqwZ892HtDxmo8v40M2zEE+lCo3AVN+ryuWPaj72UkIb5Obg03NAHfLhEsEvsYsXQlcQ1phA1Q== X-Received: by 2002:a17:902:be09:: with SMTP id r9mr58183026pls.215.1555111589416; Fri, 12 Apr 2019 16:26:29 -0700 (PDT) Received: from localhost.localdomain.localdomain ([216.113.160.77]) by smtp.gmail.com with ESMTPSA id o68sm105570138pfi.140.2019.04.12.16.26.28 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 12 Apr 2019 16:26:28 -0700 (PDT) From: Han Zhou X-Google-Original-From: Han Zhou To: dev@openvswitch.org Date: Fri, 12 Apr 2019 16:26:28 -0700 Message-Id: <1555111588-79659-9-git-send-email-hzhou8@ebay.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1555111588-79659-1-git-send-email-hzhou8@ebay.com> References: <1555111588-79659-1-git-send-email-hzhou8@ebay.com> X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Subject: [ovs-dev] [PATCH v2 9/9] ovsdb raft: Fix duplicated transaction execution when leader failover. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org From: Han Zhou When a transaction is submitted from a client connected to a follower, if leader crashes after receiving the execute_command_request from the follower and sending out append request to the majority of followers, but before sending execute_command_reply to the follower. The transaction would finally got commited by the new leader. However, with current implementation the transaction would be commited twice. For the root cause, there are two cases: Case 1, the connected follower becomes the new leader. In this case, the pending command of the follower will be cancelled during its role changing to leader, so the trigger for the transaction will be retried. Case 2, another follower becomes the new leader. In this case, since there is no execute_command_reply from the original leader (which has crashed), the command will finally timed out, causing the trigger for the transaction retried. In both cases, the transaction will be retried by the server node's trigger retrying logic. This patch fixes the problem by below changes: 1) A pending command can be completed not only by execute_command_reply, but also when the eid is committed, if the execute_command_reply never came. 2) Instead of cancelling all pending commands during role change, let the commands continue waiting to be completed when the eid is committed. The timer is increased to be twice the election base time, so that it has the chance to be completed when leader crashes. This patch fixes the two raft failure test cases previously disabled. See the test case for details of how to reproduce the problem. Signed-off-by: Han Zhou --- ovsdb/raft.c | 76 ++++++++++++++++++++++++++++++++------------------ tests/ovsdb-cluster.at | 4 --- 2 files changed, 49 insertions(+), 31 deletions(-) diff --git a/ovsdb/raft.c b/ovsdb/raft.c index 02bdf11..77ad365 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -1607,10 +1607,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer) return; } - raft_complete_all_commands(raft, RAFT_CMD_LOST_LEADERSHIP); - ovs_assert(raft->role != RAFT_LEADER); - ovs_assert(hmap_is_empty(&raft->commands)); raft->role = RAFT_CANDIDATE; raft->n_votes = 0; @@ -1793,17 +1790,22 @@ raft_run(struct raft *raft) } } - if (time_msec() >= raft->ping_timeout) { + long long int now = time_msec(); + if (now >= raft->ping_timeout) { if (raft->role == RAFT_LEADER) { raft_send_heartbeats(raft); - } else { - long long int now = time_msec(); - struct raft_command *cmd, *next_cmd; - HMAP_FOR_EACH_SAFE (cmd, next_cmd, hmap_node, &raft->commands) { - if (cmd->timestamp - && now - cmd->timestamp > ELECTION_BASE_MSEC) { - raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT); - } + } + /* Check if any commands timeout. Timeout is set to twice the time of + * election base time so that commands can complete properly during + * leader election. E.g. a leader crashed and current node with pending + * commands becomes new leader: the pending commands can still complete + * if the crashed leader has replicated the transactions to majority of + * followers before it crashed. */ + struct raft_command *cmd, *next_cmd; + HMAP_FOR_EACH_SAFE (cmd, next_cmd, hmap_node, &raft->commands) { + if (cmd->timestamp + && now - cmd->timestamp > ELECTION_BASE_MSEC * 2) { + raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT); } raft_reset_ping_timer(raft); } @@ -1974,6 +1976,7 @@ raft_command_initiate(struct raft *raft, struct raft_command *cmd = raft_command_create_incomplete(raft, index); ovs_assert(eid); cmd->eid = *eid; + cmd->timestamp = time_msec(); raft_waiter_create(raft, RAFT_W_ENTRY, true)->entry.index = cmd->index; @@ -1996,6 +1999,15 @@ raft_command_initiate(struct raft *raft, return cmd; } +static void +log_all_commands(struct raft *raft) +{ + struct raft_command *cmd, *next; + HMAP_FOR_EACH_SAFE (cmd, next, hmap_node, &raft->commands) { + VLOG_DBG("raft command eid: "UUID_FMT, UUID_ARGS(&cmd->eid)); + } +} + static struct raft_command * OVS_WARN_UNUSED_RESULT raft_command_execute__(struct raft *raft, const struct json *data, const struct json *servers, @@ -2051,6 +2063,7 @@ raft_command_execute__(struct raft *raft, struct raft_command *cmd = raft_command_create_incomplete(raft, 0); cmd->timestamp = time_msec(); cmd->eid = eid; + log_all_commands(raft); return cmd; } @@ -2126,6 +2139,8 @@ raft_command_complete(struct raft *raft, struct raft_command *cmd, enum raft_command_status status) { + VLOG_DBG("raft_command_complete eid "UUID_FMT" status: %s", + UUID_ARGS(&cmd->eid), raft_command_status_to_string(status)); if (!uuid_is_zero(&cmd->sid)) { uint64_t commit_index = status == RAFT_CMD_SUCCESS ? cmd->index : 0; raft_send_execute_command_reply(raft, &cmd->sid, &cmd->eid, status, @@ -2149,19 +2164,6 @@ raft_complete_all_commands(struct raft *raft, enum raft_command_status status) } static struct raft_command * -raft_find_command_by_index(struct raft *raft, uint64_t index) -{ - struct raft_command *cmd; - - HMAP_FOR_EACH_IN_BUCKET (cmd, hmap_node, index, &raft->commands) { - if (cmd->index == index) { - return cmd; - } - } - return NULL; -} - -static struct raft_command * raft_find_command_by_eid(struct raft *raft, const struct uuid *eid) { struct raft_command *cmd; @@ -2441,7 +2443,7 @@ raft_server_init_leader(struct raft *raft, struct raft_server *s) static void raft_become_leader(struct raft *raft) { - raft_complete_all_commands(raft, RAFT_CMD_LOST_LEADERSHIP); + log_all_commands(raft); static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5); VLOG_INFO_RL(&rl, "term %"PRIu64": elected leader by %d+ of " @@ -2627,8 +2629,14 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) const struct raft_entry *e = raft_get_entry(raft, index); if (e->data) { struct raft_command *cmd - = raft_find_command_by_index(raft, index); + = raft_find_command_by_eid(raft, &e->eid); if (cmd) { + if (!cmd->index) { + VLOG_DBG("Command completed after role change from" + " follower to leader "UUID_FMT, + UUID_ARGS(&e->eid)); + cmd->index = index; + } raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS); } } @@ -2641,6 +2649,20 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) } } else { raft->commit_index = new_commit_index; + /* Check if any pending command can be completed, and complete it. + * This can happen when leader fail-over before sending + * execute_command_reply. */ + const struct uuid *eid = raft_get_eid(raft, new_commit_index); + struct raft_command *cmd = raft_find_command_by_eid(raft, eid); + if (cmd) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); + VLOG_INFO_RL(&rl, + "Command completed without reply (eid: "UUID_FMT", " + "commit index: %"PRIu64")", + UUID_ARGS(eid), new_commit_index); + cmd->index = new_commit_index; + raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS); + } } /* Write the commit index to the log. The next time we restart, this diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index 4e88766..931da6e 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -138,8 +138,6 @@ AT_BANNER([OVSDB - cluster failure with pending transaction]) AT_SETUP([OVSDB cluster - txn on follower-2, leader crash before sending appendReq, follower-2 becomes leader]) AT_KEYWORDS([ovsdb server negative unix cluster pending-txn]) -# XXX: fix bug before enabling this test -AT_CHECK([exit 77]) ovsdb_cluster_failure_test 2 3 1 crash-before-sending-append-request 2 AT_CLEANUP @@ -157,8 +155,6 @@ AT_CLEANUP AT_SETUP([OVSDB cluster - txn on follower-2, leader crash before sending execRep, follower-3 becomes leader]) AT_KEYWORDS([ovsdb server negative unix cluster pending-txn]) -# XXX: fix bug before enabling this test -AT_CHECK([exit 77]) ovsdb_cluster_failure_test 2 3 1 crash-before-sending-execute-command-reply 3 AT_CLEANUP