diff mbox series

[ovs-dev] docs: Document manual cluster recovery procedure.

Message ID 20240412023143.4157646-1-ihrachys@redhat.com
State Superseded, archived
Delegated to: Simon Horman
Headers show
Series [ovs-dev] docs: Document manual cluster recovery procedure. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/intel-ovs-compilation success test: success
ovsrobot/github-robot-_Build_and_Test fail github build: failed

Commit Message

Ihar Hrachyshka April 12, 2024, 2:31 a.m. UTC
Remove the notion of cluster/leave --force since it was never
implemented. Instead of these instructions, document how a broken
cluster can be re-initialized with the old database contents.

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
---
 Documentation/ref/ovsdb.7.rst | 50 +++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 8 deletions(-)

Comments

Simon Horman April 12, 2024, 9:35 a.m. UTC | #1
On Fri, Apr 12, 2024 at 02:31:43AM +0000, Ihar Hrachyshka wrote:
> Remove the notion of cluster/leave --force since it was never
> implemented. Instead of these instructions, document how a broken
> cluster can be re-initialized with the old database contents.
> 
> Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>

Acked-by: Simon Horman <horms@ovn.org>
diff mbox series

Patch

diff --git a/Documentation/ref/ovsdb.7.rst b/Documentation/ref/ovsdb.7.rst
index 46ed13e61..5882643a0 100644
--- a/Documentation/ref/ovsdb.7.rst
+++ b/Documentation/ref/ovsdb.7.rst
@@ -315,16 +315,11 @@  The above methods for adding and removing servers only work for healthy
 clusters, that is, for clusters with no more failures than their maximum
 tolerance.  For example, in a 3-server cluster, the failure of 2 servers
 prevents servers joining or leaving the cluster (as well as database access).
+
 To prevent data loss or inconsistency, the preferred solution to this problem
 is to bring up enough of the failed servers to make the cluster healthy again,
-then if necessary remove any remaining failed servers and add new ones.  If
-this cannot be done, though, use ``ovs-appctl`` to invoke ``cluster/leave
---force`` on a running server.  This command forces the server to which it is
-directed to leave its cluster and form a new single-node cluster that contains
-only itself.  The data in the new cluster may be inconsistent with the former
-cluster: transactions not yet replicated to the server will be lost, and
-transactions not yet applied to the cluster may be committed.  Afterward, any
-servers in its former cluster will regard the server to have failed.
+then if necessary remove any remaining failed servers and add new ones. If this
+is not an option, see the next section for manual recovery procedure.
 
 Once a server leaves a cluster, it may never rejoin it.  Instead, create a new
 server and join it to the cluster.
@@ -362,6 +357,45 @@  Clustered OVSDB does not support the OVSDB "ephemeral columns" feature.
 ones when they work with schemas for clustered databases.  Future versions of
 OVSDB might add support for this feature.
 
+Manual cluster recovery
+~~~~~~~~~~~~~~~~~~~~~~~
+
+If kicking and rejoining failed members to the existing cluster is not
+available in your environment, you may consider to recover the cluster
+manually, as follows.
+
+*Important*: The procedure below will result in `cid` and `sid` change.
+Afterward, any servers in the former cluster will regard the recovered server
+failed.
+
+If you understand the risks and are still willing to proceed, then:
+
+1. Stop the old cluster ``ovsdb-server`` instances before proceeding.
+
+2. Pick one of the old members which will serve as the bootstrap member of the
+   to-be-recovered cluster.
+
+3. Convert its database file to standalone format using ``ovsdb-tool
+   cluster-to-standalone``.
+
+4. Backup the standalone database file. You will use the file in the next step.
+
+5. Re-initialize the new cluster with the bootstrap member (``ovsdb-tool
+   create-cluster``) using the previously saved database file.
+
+6. Start the bootstrapped cluster with this new member.
+
+Once you confirmed that the single member cluster is up and running and serves
+the restored data, you may proceed with joining the rest of the members to the
+newly formed cluster, as usual (``ovsdb-tool join-cluster``).
+
+Once the cluster is restored, any active clients will have to reconnect to the
+new cluster.
+
+Note: The data in the new cluster may be inconsistent with the former cluster:
+transactions not yet replicated to the server will be lost, and transactions
+not yet applied to the cluster may be committed.
+
 Upgrading from version 2.14 and earlier to 2.15 and later
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~