diff mbox series

[ovs-dev,v2] docs: Document manual cluster recovery procedure.

Message ID 20240412131226.4162752-1-ihrachys@redhat.com
State Changes Requested
Headers show
Series [ovs-dev,v2] docs: Document manual cluster recovery procedure. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Ihar Hrachyshka April 12, 2024, 1:12 p.m. UTC
Remove the notion of cluster/leave --force since it was never
implemented. Instead of these instructions, document how a broken
cluster can be re-initialized with the old database contents.

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>

---

v1: initial version.
v2: remove --force mentioned in ovsd-server(1).

---
 Documentation/ref/ovsdb.7.rst | 50 +++++++++++++++++++++++++++++------
 ovsdb/ovsdb-server.1.in       |  3 +--
 2 files changed, 43 insertions(+), 10 deletions(-)

Comments

Ilya Maximets April 12, 2024, 2:07 p.m. UTC | #1
On 4/12/24 15:12, Ihar Hrachyshka wrote:
> Remove the notion of cluster/leave --force since it was never
> implemented. Instead of these instructions, document how a broken
> cluster can be re-initialized with the old database contents.
> 
> Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>

Hi, Ihar.  Thanks for cleaning this up!

See some comments inline.

Best regards, Ilya Maximets.

> 
> ---
> 
> v1: initial version.
> v2: remove --force mentioned in ovsd-server(1).
> 
> ---
>  Documentation/ref/ovsdb.7.rst | 50 +++++++++++++++++++++++++++++------
>  ovsdb/ovsdb-server.1.in       |  3 +--
>  2 files changed, 43 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/ref/ovsdb.7.rst b/Documentation/ref/ovsdb.7.rst
> index 46ed13e61..5882643a0 100644
> --- a/Documentation/ref/ovsdb.7.rst
> +++ b/Documentation/ref/ovsdb.7.rst
> @@ -315,16 +315,11 @@ The above methods for adding and removing servers only work for healthy
>  clusters, that is, for clusters with no more failures than their maximum
>  tolerance.  For example, in a 3-server cluster, the failure of 2 servers
>  prevents servers joining or leaving the cluster (as well as database access).
> +
>  To prevent data loss or inconsistency, the preferred solution to this problem
>  is to bring up enough of the failed servers to make the cluster healthy again,
> -then if necessary remove any remaining failed servers and add new ones.  If
> -this cannot be done, though, use ``ovs-appctl`` to invoke ``cluster/leave
> ---force`` on a running server.  This command forces the server to which it is
> -directed to leave its cluster and form a new single-node cluster that contains
> -only itself.  The data in the new cluster may be inconsistent with the former
> -cluster: transactions not yet replicated to the server will be lost, and
> -transactions not yet applied to the cluster may be committed.  Afterward, any
> -servers in its former cluster will regard the server to have failed.
> +then if necessary remove any remaining failed servers and add new ones. If this
> +is not an option, see the next section for manual recovery procedure.

The link to the section should be used here:

... see the next section for `Manual cluster recovery`_.

>  
>  Once a server leaves a cluster, it may never rejoin it.  Instead, create a new
>  server and join it to the cluster.
> @@ -362,6 +357,45 @@ Clustered OVSDB does not support the OVSDB "ephemeral columns" feature.
>  ones when they work with schemas for clustered databases.  Future versions of
>  OVSDB might add support for this feature.
>  
> +Manual cluster recovery
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +If kicking and rejoining failed members to the existing cluster is not
> +available in your environment, you may consider to recover the cluster

Please, avoid personal pronouns like 'your/you' in documentation.  Documentation
should describe the process or functionality, it should not generally talk to a
reader.  For example, in this part, it can be:

"""
If kicking and rejoining failed members to the existing cluster is not
available, it is possible to recover the cluster manually, as follows.
"""

Also, kicked server can't re-join without following this manual recovery procedure,
so this sentence is probably not something we should say.

> +manually, as follows.
> +
> +*Important*: The procedure below will result in `cid` and `sid` change.

It should be in '.. important::' section instead.
Also, double-quote cid and sid.

> +Afterward, any servers in the former cluster will regard the recovered server
> +failed.

There will be no former cluster, the procedure below is asking to stop
all the members...

> +
> +If you understand the risks and are still willing to proceed, then:

Would be nice to re-phrase this to not have 'you'.

> +
> +1. Stop the old cluster ``ovsdb-server`` instances before proceeding.

Might make sense to clarify that all servers should be stopped.

> +
> +2. Pick one of the old members which will serve as the bootstrap member of the

'as a bootstrap' ?

> +   to-be-recovered cluster.
> +
> +3. Convert its database file to standalone format using ``ovsdb-tool

'to a standalone format' ?

> +   cluster-to-standalone``.
> +
> +4. Backup the standalone database file. You will use the file in the next step.

The second sentence can be removed, I think.

> +
> +5. Re-initialize the new cluster with the bootstrap member (``ovsdb-tool
> +   create-cluster``) using the previously saved database file.

Maybe "Create a new single-node cluster with ``ovsdb-tool create-cluster``
using using the previously saved standalone database file, then strart
``ovsdb-server``" ?

> +
> +6. Start the bootstrapped cluster with this new member.

This is a little confusing.  Should probably be removed.

> +
> +Once you confirmed that the single member cluster is up and running and serves
> +the restored data, you may proceed with joining the rest of the members to the
> +newly formed cluster, as usual (``ovsdb-tool join-cluster``).

'you'

We use 'single-node' in other places in the doc, not 'single member'.

'joining the rest of the members' is confusing.  We're not using anything from
the rest of the nodes.  New members should be created.

> +
> +Once the cluster is restored, any active clients will have to reconnect to the
> +new cluster.

The cluster was stopped on step 1...

> +
> +Note: The data in the new cluster may be inconsistent with the former cluster:

.. note::

> +transactions not yet replicated to the server will be lost, and transactions

"to the server chosen on step 2"

> +not yet applied to the cluster may be committed.
> +
>  Upgrading from version 2.14 and earlier to 2.15 and later
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> diff --git a/ovsdb/ovsdb-server.1.in b/ovsdb/ovsdb-server.1.in
> index 9fabf2d67..23b8e6e9c 100644
> --- a/ovsdb/ovsdb-server.1.in
> +++ b/ovsdb/ovsdb-server.1.in
> @@ -461,8 +461,7 @@ This does not result in a three server cluster that lacks quorum.
>  .
>  .IP "\fBcluster/kick \fIdb server\fR"
>  Start graceful removal of \fIserver\fR from \fIdb\fR's cluster, like
> -\fBcluster/leave\fR (without \fB\-\-force\fR) except that it can
> -remove any server, not just this one.
> +\fBcluster/leave\fR, except that it can remove any server, not just this one.
>  .IP
>  \fIserver\fR may be a server ID, as printed by \fBcluster/sid\fR, or
>  the server's local network address as passed to \fBovsdb-tool\fR's
diff mbox series

Patch

diff --git a/Documentation/ref/ovsdb.7.rst b/Documentation/ref/ovsdb.7.rst
index 46ed13e61..5882643a0 100644
--- a/Documentation/ref/ovsdb.7.rst
+++ b/Documentation/ref/ovsdb.7.rst
@@ -315,16 +315,11 @@  The above methods for adding and removing servers only work for healthy
 clusters, that is, for clusters with no more failures than their maximum
 tolerance.  For example, in a 3-server cluster, the failure of 2 servers
 prevents servers joining or leaving the cluster (as well as database access).
+
 To prevent data loss or inconsistency, the preferred solution to this problem
 is to bring up enough of the failed servers to make the cluster healthy again,
-then if necessary remove any remaining failed servers and add new ones.  If
-this cannot be done, though, use ``ovs-appctl`` to invoke ``cluster/leave
---force`` on a running server.  This command forces the server to which it is
-directed to leave its cluster and form a new single-node cluster that contains
-only itself.  The data in the new cluster may be inconsistent with the former
-cluster: transactions not yet replicated to the server will be lost, and
-transactions not yet applied to the cluster may be committed.  Afterward, any
-servers in its former cluster will regard the server to have failed.
+then if necessary remove any remaining failed servers and add new ones. If this
+is not an option, see the next section for manual recovery procedure.
 
 Once a server leaves a cluster, it may never rejoin it.  Instead, create a new
 server and join it to the cluster.
@@ -362,6 +357,45 @@  Clustered OVSDB does not support the OVSDB "ephemeral columns" feature.
 ones when they work with schemas for clustered databases.  Future versions of
 OVSDB might add support for this feature.
 
+Manual cluster recovery
+~~~~~~~~~~~~~~~~~~~~~~~
+
+If kicking and rejoining failed members to the existing cluster is not
+available in your environment, you may consider to recover the cluster
+manually, as follows.
+
+*Important*: The procedure below will result in `cid` and `sid` change.
+Afterward, any servers in the former cluster will regard the recovered server
+failed.
+
+If you understand the risks and are still willing to proceed, then:
+
+1. Stop the old cluster ``ovsdb-server`` instances before proceeding.
+
+2. Pick one of the old members which will serve as the bootstrap member of the
+   to-be-recovered cluster.
+
+3. Convert its database file to standalone format using ``ovsdb-tool
+   cluster-to-standalone``.
+
+4. Backup the standalone database file. You will use the file in the next step.
+
+5. Re-initialize the new cluster with the bootstrap member (``ovsdb-tool
+   create-cluster``) using the previously saved database file.
+
+6. Start the bootstrapped cluster with this new member.
+
+Once you confirmed that the single member cluster is up and running and serves
+the restored data, you may proceed with joining the rest of the members to the
+newly formed cluster, as usual (``ovsdb-tool join-cluster``).
+
+Once the cluster is restored, any active clients will have to reconnect to the
+new cluster.
+
+Note: The data in the new cluster may be inconsistent with the former cluster:
+transactions not yet replicated to the server will be lost, and transactions
+not yet applied to the cluster may be committed.
+
 Upgrading from version 2.14 and earlier to 2.15 and later
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/ovsdb/ovsdb-server.1.in b/ovsdb/ovsdb-server.1.in
index 9fabf2d67..23b8e6e9c 100644
--- a/ovsdb/ovsdb-server.1.in
+++ b/ovsdb/ovsdb-server.1.in
@@ -461,8 +461,7 @@  This does not result in a three server cluster that lacks quorum.
 .
 .IP "\fBcluster/kick \fIdb server\fR"
 Start graceful removal of \fIserver\fR from \fIdb\fR's cluster, like
-\fBcluster/leave\fR (without \fB\-\-force\fR) except that it can
-remove any server, not just this one.
+\fBcluster/leave\fR, except that it can remove any server, not just this one.
 .IP
 \fIserver\fR may be a server ID, as printed by \fBcluster/sid\fR, or
 the server's local network address as passed to \fBovsdb-tool\fR's