diff mbox series

[ovs-dev,v3] docs: Document manual cluster recovery procedure.

Message ID 20240426165448.42125-1-ihrachys@redhat.com
State Accepted
Commit 01a0fff36104790640e274f1d457084aeb5b968d
Delegated to: Ilya Maximets
Headers show
Series [ovs-dev,v3] docs: Document manual cluster recovery procedure. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Ihar Hrachyshka April 26, 2024, 4:54 p.m. UTC
Remove the notion of cluster/leave --force since it was never
implemented. Instead of these instructions, document how a broken
cluster can be re-initialized with the old database contents.

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>

---

v1: initial version.
v2: remove --force mentioned in ovsdb-server(1).
v3: multiple language and markup changes suggested by Ilya.

---
 Documentation/ref/ovsdb.7.rst | 44 ++++++++++++++++++++++++++++-------
 ovsdb/ovsdb-server.1.in       |  3 +--
 2 files changed, 37 insertions(+), 10 deletions(-)

Comments

Simon Horman May 1, 2024, 10:46 a.m. UTC | #1
On Fri, Apr 26, 2024 at 04:54:48PM +0000, Ihar Hrachyshka wrote:
> Remove the notion of cluster/leave --force since it was never
> implemented. Instead of these instructions, document how a broken
> cluster can be re-initialized with the old database contents.
> 
> Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
> 
> ---
> 
> v1: initial version.
> v2: remove --force mentioned in ovsdb-server(1).
> v3: multiple language and markup changes suggested by Ilya.

Thanks for the updates Ihar, this version looks good to me.

Acked-by: Simon Horman <horms@ovn.org>

...
Ilya Maximets May 2, 2024, 9:52 p.m. UTC | #2
On 4/26/24 18:54, Ihar Hrachyshka wrote:
> Remove the notion of cluster/leave --force since it was never
> implemented. Instead of these instructions, document how a broken
> cluster can be re-initialized with the old database contents.
> 
> Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
> 
> ---
> 
> v1: initial version.
> v2: remove --force mentioned in ovsdb-server(1).
> v3: multiple language and markup changes suggested by Ilya.

Thanks, Ihar!  This version looks good to me in general.
I have a couple of minor nits below.  If you agree, I can
fold those in while applying the change.

Let me know what you think.

Best regards, Ilya Maximets.

> 
> ---
>  Documentation/ref/ovsdb.7.rst | 44 ++++++++++++++++++++++++++++-------
>  ovsdb/ovsdb-server.1.in       |  3 +--
>  2 files changed, 37 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/ref/ovsdb.7.rst b/Documentation/ref/ovsdb.7.rst
> index 46ed13e61..5766e64b9 100644
> --- a/Documentation/ref/ovsdb.7.rst
> +++ b/Documentation/ref/ovsdb.7.rst
> @@ -315,16 +315,11 @@ The above methods for adding and removing servers only work for healthy
>  clusters, that is, for clusters with no more failures than their maximum
>  tolerance.  For example, in a 3-server cluster, the failure of 2 servers
>  prevents servers joining or leaving the cluster (as well as database access).
> +
>  To prevent data loss or inconsistency, the preferred solution to this problem
>  is to bring up enough of the failed servers to make the cluster healthy again,
> -then if necessary remove any remaining failed servers and add new ones.  If
> -this cannot be done, though, use ``ovs-appctl`` to invoke ``cluster/leave
> ---force`` on a running server.  This command forces the server to which it is
> -directed to leave its cluster and form a new single-node cluster that contains
> -only itself.  The data in the new cluster may be inconsistent with the former
> -cluster: transactions not yet replicated to the server will be lost, and
> -transactions not yet applied to the cluster may be committed.  Afterward, any
> -servers in its former cluster will regard the server to have failed.
> +then if necessary remove any remaining failed servers and add new ones. If this

Nit:  2 spaces between sentences.

> +is not an option, see the next section for `Manual cluster recovery`_.
>  
>  Once a server leaves a cluster, it may never rejoin it.  Instead, create a new
>  server and join it to the cluster.
> @@ -362,6 +357,39 @@ Clustered OVSDB does not support the OVSDB "ephemeral columns" feature.
>  ones when they work with schemas for clustered databases.  Future versions of
>  OVSDB might add support for this feature.
>  
> +Manual cluster recovery
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +.. important::

Nit: An empty line here would be nice to be consistent at least
     within this document.

> +   The procedure below will result in ``cid`` and ``sid`` change. A *new*

Nit:  2 spaces between sentences.

> +   cluster will be initialized.
> +
> +To recover a clustered database after a failure:
> +
> +1. Stop *all* old cluster ``ovsdb-server`` instances before proceeding.
> +
> +2. Pick one of the old members which will serve as a bootstrap member of the
> +   to-be-recovered cluster.
> +
> +3. Convert its database file to the standalone format using ``ovsdb-tool
> +   cluster-to-standalone``.
> +
> +4. Backup the standalone database file.
> +
> +5. Create a new single-node cluster with ``ovsdb-tool create-cluster``
> +   using the previously saved standalone database file, then start
> +   ``ovsdb-server``.
> +
> +Once the single-node cluster is up and running and serves the restored data,
> +new members should be created and join the new cluster, as usual (``ovsdb-tool
> +join-cluster``).

I'm having hard time reading 'new members should be created and join' as
my brain wants to relate 'should be' to both 'created' and 'join' and
'should be join' is not a correct construct.

How about: "new members should be created and added to the cluster, as usual,
with ``ovsdb-tool join-cluster``."  ?

Also, should it be a step 6 ?

> +
> +.. note::
> +
> +   The data in the new cluster may be inconsistent with the former cluster:
> +   transactions not yet replicated to the server chosen in step 2 will be lost,
> +   and transactions not yet applied to the cluster may be committed.
> +
>  Upgrading from version 2.14 and earlier to 2.15 and later
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> diff --git a/ovsdb/ovsdb-server.1.in b/ovsdb/ovsdb-server.1.in
> index 9fabf2d67..23b8e6e9c 100644
> --- a/ovsdb/ovsdb-server.1.in
> +++ b/ovsdb/ovsdb-server.1.in
> @@ -461,8 +461,7 @@ This does not result in a three server cluster that lacks quorum.
>  .
>  .IP "\fBcluster/kick \fIdb server\fR"
>  Start graceful removal of \fIserver\fR from \fIdb\fR's cluster, like
> -\fBcluster/leave\fR (without \fB\-\-force\fR) except that it can
> -remove any server, not just this one.
> +\fBcluster/leave\fR, except that it can remove any server, not just this one.
>  .IP
>  \fIserver\fR may be a server ID, as printed by \fBcluster/sid\fR, or
>  the server's local network address as passed to \fBovsdb-tool\fR's
Ihar Hrachyshka May 2, 2024, 10:42 p.m. UTC | #3
On Thu, May 2, 2024 at 5:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:

> On 4/26/24 18:54, Ihar Hrachyshka wrote:
> > Remove the notion of cluster/leave --force since it was never
> > implemented. Instead of these instructions, document how a broken
> > cluster can be re-initialized with the old database contents.
> >
> > Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
> >
> > ---
> >
> > v1: initial version.
> > v2: remove --force mentioned in ovsdb-server(1).
> > v3: multiple language and markup changes suggested by Ilya.
>
> Thanks, Ihar!  This version looks good to me in general.
> I have a couple of minor nits below.  If you agree, I can
> fold those in while applying the change.
>

Feel free to. And thanks for your patience.


>
> Let me know what you think.
>
> Best regards, Ilya Maximets.
>
> >
> > ---
> >  Documentation/ref/ovsdb.7.rst | 44 ++++++++++++++++++++++++++++-------
> >  ovsdb/ovsdb-server.1.in       |  3 +--
> >  2 files changed, 37 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/ref/ovsdb.7.rst
> b/Documentation/ref/ovsdb.7.rst
> > index 46ed13e61..5766e64b9 100644
> > --- a/Documentation/ref/ovsdb.7.rst
> > +++ b/Documentation/ref/ovsdb.7.rst
> > @@ -315,16 +315,11 @@ The above methods for adding and removing servers
> only work for healthy
> >  clusters, that is, for clusters with no more failures than their maximum
> >  tolerance.  For example, in a 3-server cluster, the failure of 2 servers
> >  prevents servers joining or leaving the cluster (as well as database
> access).
> > +
> >  To prevent data loss or inconsistency, the preferred solution to this
> problem
> >  is to bring up enough of the failed servers to make the cluster healthy
> again,
> > -then if necessary remove any remaining failed servers and add new
> ones.  If
> > -this cannot be done, though, use ``ovs-appctl`` to invoke
> ``cluster/leave
> > ---force`` on a running server.  This command forces the server to which
> it is
> > -directed to leave its cluster and form a new single-node cluster that
> contains
> > -only itself.  The data in the new cluster may be inconsistent with the
> former
> > -cluster: transactions not yet replicated to the server will be lost, and
> > -transactions not yet applied to the cluster may be committed.
> Afterward, any
> > -servers in its former cluster will regard the server to have failed.
> > +then if necessary remove any remaining failed servers and add new ones.
> If this
>
> Nit:  2 spaces between sentences.
>
> > +is not an option, see the next section for `Manual cluster recovery`_.
> >
> >  Once a server leaves a cluster, it may never rejoin it.  Instead,
> create a new
> >  server and join it to the cluster.
> > @@ -362,6 +357,39 @@ Clustered OVSDB does not support the OVSDB
> "ephemeral columns" feature.
> >  ones when they work with schemas for clustered databases.  Future
> versions of
> >  OVSDB might add support for this feature.
> >
> > +Manual cluster recovery
> > +~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +.. important::
>
> Nit: An empty line here would be nice to be consistent at least
>      within this document.
>
> > +   The procedure below will result in ``cid`` and ``sid`` change. A
> *new*
>
> Nit:  2 spaces between sentences.
>
> > +   cluster will be initialized.
> > +
> > +To recover a clustered database after a failure:
> > +
> > +1. Stop *all* old cluster ``ovsdb-server`` instances before proceeding.
> > +
> > +2. Pick one of the old members which will serve as a bootstrap member
> of the
> > +   to-be-recovered cluster.
> > +
> > +3. Convert its database file to the standalone format using ``ovsdb-tool
> > +   cluster-to-standalone``.
> > +
> > +4. Backup the standalone database file.
> > +
> > +5. Create a new single-node cluster with ``ovsdb-tool create-cluster``
> > +   using the previously saved standalone database file, then start
> > +   ``ovsdb-server``.
> > +
> > +Once the single-node cluster is up and running and serves the restored
> data,
> > +new members should be created and join the new cluster, as usual
> (``ovsdb-tool
> > +join-cluster``).
>
> I'm having hard time reading 'new members should be created and join' as
> my brain wants to relate 'should be' to both 'created' and 'join' and
> 'should be join' is not a correct construct.
>
> How about: "new members should be created and added to the cluster, as
> usual,
> with ``ovsdb-tool join-cluster``."  ?
>

Though it doesn't confuse me, I am not a native speaker, and I find your
version at least as good as mine, so feel free to change.


>
> Also, should it be a step 6 ?
>
>
It won't hurt to fold it into the list.


> > +
> > +.. note::
> > +
> > +   The data in the new cluster may be inconsistent with the former
> cluster:
> > +   transactions not yet replicated to the server chosen in step 2 will
> be lost,
> > +   and transactions not yet applied to the cluster may be committed.
> > +
> >  Upgrading from version 2.14 and earlier to 2.15 and later
> >  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > diff --git a/ovsdb/ovsdb-server.1.in b/ovsdb/ovsdb-server.1.in
> > index 9fabf2d67..23b8e6e9c 100644
> > --- a/ovsdb/ovsdb-server.1.in
> > +++ b/ovsdb/ovsdb-server.1.in
> > @@ -461,8 +461,7 @@ This does not result in a three server cluster that
> lacks quorum.
> >  .
> >  .IP "\fBcluster/kick \fIdb server\fR"
> >  Start graceful removal of \fIserver\fR from \fIdb\fR's cluster, like
> > -\fBcluster/leave\fR (without \fB\-\-force\fR) except that it can
> > -remove any server, not just this one.
> > +\fBcluster/leave\fR, except that it can remove any server, not just
> this one.
> >  .IP
> >  \fIserver\fR may be a server ID, as printed by \fBcluster/sid\fR, or
> >  the server's local network address as passed to \fBovsdb-tool\fR's
>
>
Ilya Maximets May 3, 2024, 2:10 p.m. UTC | #4
On 5/3/24 00:42, Ihar Hrachyshka wrote:
> On Thu, May 2, 2024 at 5:52 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> 
>     On 4/26/24 18:54, Ihar Hrachyshka wrote:
>     > Remove the notion of cluster/leave --force since it was never
>     > implemented. Instead of these instructions, document how a broken
>     > cluster can be re-initialized with the old database contents.
>     >
>     > Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com <mailto:ihrachys@redhat.com>>
>     >
>     > ---
>     >
>     > v1: initial version.
>     > v2: remove --force mentioned in ovsdb-server(1).
>     > v3: multiple language and markup changes suggested by Ilya.
> 
>     Thanks, Ihar!  This version looks good to me in general.
>     I have a couple of minor nits below.  If you agree, I can
>     fold those in while applying the change.
> 
> 
> Feel free to. And thanks for your patience.

Thanks, Ihar and Simon!  I made the discussed changes and applied the patch.

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/Documentation/ref/ovsdb.7.rst b/Documentation/ref/ovsdb.7.rst
index 46ed13e61..5766e64b9 100644
--- a/Documentation/ref/ovsdb.7.rst
+++ b/Documentation/ref/ovsdb.7.rst
@@ -315,16 +315,11 @@  The above methods for adding and removing servers only work for healthy
 clusters, that is, for clusters with no more failures than their maximum
 tolerance.  For example, in a 3-server cluster, the failure of 2 servers
 prevents servers joining or leaving the cluster (as well as database access).
+
 To prevent data loss or inconsistency, the preferred solution to this problem
 is to bring up enough of the failed servers to make the cluster healthy again,
-then if necessary remove any remaining failed servers and add new ones.  If
-this cannot be done, though, use ``ovs-appctl`` to invoke ``cluster/leave
---force`` on a running server.  This command forces the server to which it is
-directed to leave its cluster and form a new single-node cluster that contains
-only itself.  The data in the new cluster may be inconsistent with the former
-cluster: transactions not yet replicated to the server will be lost, and
-transactions not yet applied to the cluster may be committed.  Afterward, any
-servers in its former cluster will regard the server to have failed.
+then if necessary remove any remaining failed servers and add new ones. If this
+is not an option, see the next section for `Manual cluster recovery`_.
 
 Once a server leaves a cluster, it may never rejoin it.  Instead, create a new
 server and join it to the cluster.
@@ -362,6 +357,39 @@  Clustered OVSDB does not support the OVSDB "ephemeral columns" feature.
 ones when they work with schemas for clustered databases.  Future versions of
 OVSDB might add support for this feature.
 
+Manual cluster recovery
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. important::
+   The procedure below will result in ``cid`` and ``sid`` change. A *new*
+   cluster will be initialized.
+
+To recover a clustered database after a failure:
+
+1. Stop *all* old cluster ``ovsdb-server`` instances before proceeding.
+
+2. Pick one of the old members which will serve as a bootstrap member of the
+   to-be-recovered cluster.
+
+3. Convert its database file to the standalone format using ``ovsdb-tool
+   cluster-to-standalone``.
+
+4. Backup the standalone database file.
+
+5. Create a new single-node cluster with ``ovsdb-tool create-cluster``
+   using the previously saved standalone database file, then start
+   ``ovsdb-server``.
+
+Once the single-node cluster is up and running and serves the restored data,
+new members should be created and join the new cluster, as usual (``ovsdb-tool
+join-cluster``).
+
+.. note::
+
+   The data in the new cluster may be inconsistent with the former cluster:
+   transactions not yet replicated to the server chosen in step 2 will be lost,
+   and transactions not yet applied to the cluster may be committed.
+
 Upgrading from version 2.14 and earlier to 2.15 and later
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/ovsdb/ovsdb-server.1.in b/ovsdb/ovsdb-server.1.in
index 9fabf2d67..23b8e6e9c 100644
--- a/ovsdb/ovsdb-server.1.in
+++ b/ovsdb/ovsdb-server.1.in
@@ -461,8 +461,7 @@  This does not result in a three server cluster that lacks quorum.
 .
 .IP "\fBcluster/kick \fIdb server\fR"
 Start graceful removal of \fIserver\fR from \fIdb\fR's cluster, like
-\fBcluster/leave\fR (without \fB\-\-force\fR) except that it can
-remove any server, not just this one.
+\fBcluster/leave\fR, except that it can remove any server, not just this one.
 .IP
 \fIserver\fR may be a server ID, as printed by \fBcluster/sid\fR, or
 the server's local network address as passed to \fBovsdb-tool\fR's