[ovs-dev] OVN resource agent - make promotion synchronous
diff mbox series

Message ID 20190709070257.11955-1-michele@acksyn.org
State Accepted
Headers show
Series
  • [ovs-dev] OVN resource agent - make promotion synchronous
Related show

Commit Message

Michele Baldessari July 9, 2019, 7:02 a.m. UTC
Currently inside the ovsdb_server_promote() function we call 'promote_ovnnb'
and 'promote_ovnsb' and then just record the new master state in the
CIB.

This creates a race because those two promote commands are asynchronous
so when we exit the ovsdb_server_promote() function the underlying DBs
are not guaranteed to be in master state. That means that clients might
connect to an instance that is in read-only mode.

We add a simple sleep loop where we wait for the underlying DB state to
confirm the master state. We do not need to add a timeout loop because
in case of an issue the resource timeout set within pacemaker will kick
in and the resource agent script will be killed by pacemaker.

Tested this within an openstack environment using ovn with roughly ~20
reboots and was unable to trigger the issue (before the patch we would
trigger the issue after a couple of reboots tops).

Signed-off-by: Michele Baldessari <michele@acksyn.org>
---
 ovn/utilities/ovndb-servers.ocf | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Daniel Alvarez Sanchez July 9, 2019, 7:27 a.m. UTC | #1
Thanks a lot Michele.
Just mentioning that this has been tested in an OpenStack environment
successfully. A timeout is not needed for the while loop since
pacemaker will enforce its own.

On Tue, Jul 9, 2019 at 9:20 AM Michele Baldessari <michele@acksyn.org> wrote:
>
> Currently inside the ovsdb_server_promote() function we call 'promote_ovnnb'
> and 'promote_ovnsb' and then just record the new master state in the
> CIB.
>
> This creates a race because those two promote commands are asynchronous
> so when we exit the ovsdb_server_promote() function the underlying DBs
> are not guaranteed to be in master state. That means that clients might
> connect to an instance that is in read-only mode.
>
> We add a simple sleep loop where we wait for the underlying DB state to
> confirm the master state. We do not need to add a timeout loop because
> in case of an issue the resource timeout set within pacemaker will kick
> in and the resource agent script will be killed by pacemaker.
>
> Tested this within an openstack environment using ovn with roughly ~20
> reboots and was unable to trigger the issue (before the patch we would
> trigger the issue after a couple of reboots tops).
>
> Signed-off-by: Michele Baldessari <michele@acksyn.org>
> ---
>  ovn/utilities/ovndb-servers.ocf | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/ovn/utilities/ovndb-servers.ocf b/ovn/utilities/ovndb-servers.ocf
> index 10313304cb7c..cd47426689ef 100755
> --- a/ovn/utilities/ovndb-servers.ocf
> +++ b/ovn/utilities/ovndb-servers.ocf
> @@ -516,6 +516,8 @@ ovsdb_server_stop() {
>  }
>
>  ovsdb_server_promote() {
> +    local state
> +
>      ovsdb_server_check_status ignore_northd
>      rc=$?
>      case $rc in
> @@ -540,7 +542,15 @@ ovsdb_server_promote() {
>          ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
>      fi
>
> -    ocf_log debug "ovndb_servers: Promoting $host_name as the master"
> +    ocf_log debug "ovndb_servers: Waiting for promotion $host_name as master to complete"
> +    ovsdb_server_check_status
> +    state=$?
> +    while [ "$state" != "$OCF_RUNNING_MASTER" ]; do
> +      sleep 1
> +      ovsdb_server_check_status
> +      state=$?
> +    done
> +    ocf_log debug "ovndb_servers: Promotion of $host_name as the master completed"
>      # Record ourselves so that the agent has a better chance of doing
>      # the right thing at startup
>      ${CRM_ATTR_REPL_INFO} -v "$host_name"
> --
> 2.21.0

Acked-By: Daniel Alvarez <dalvarez@redhat.com>
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Numan Siddique July 9, 2019, 7:37 a.m. UTC | #2
On Tue, Jul 9, 2019 at 1:04 PM Daniel Alvarez Sanchez <dalvarez@redhat.com>
wrote:

> Thanks a lot Michele.
> Just mentioning that this has been tested in an OpenStack environment
> successfully. A timeout is not needed for the while loop since
> pacemaker will enforce its own.
>
> On Tue, Jul 9, 2019 at 9:20 AM Michele Baldessari <michele@acksyn.org>
> wrote:
> >
> > Currently inside the ovsdb_server_promote() function we call
> 'promote_ovnnb'
> > and 'promote_ovnsb' and then just record the new master state in the
> > CIB.
> >
> > This creates a race because those two promote commands are asynchronous
> > so when we exit the ovsdb_server_promote() function the underlying DBs
> > are not guaranteed to be in master state. That means that clients might
> > connect to an instance that is in read-only mode.
> >
> > We add a simple sleep loop where we wait for the underlying DB state to
> > confirm the master state. We do not need to add a timeout loop because
> > in case of an issue the resource timeout set within pacemaker will kick
> > in and the resource agent script will be killed by pacemaker.
> >
> > Tested this within an openstack environment using ovn with roughly ~20
> > reboots and was unable to trigger the issue (before the patch we would
> > trigger the issue after a couple of reboots tops).
> >
> > Signed-off-by: Michele Baldessari <michele@acksyn.org>
>

LGTM

Acked-by: Numan Siddique <nusiddiq@redhat.com>



> > ---
> >  ovn/utilities/ovndb-servers.ocf | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/ovn/utilities/ovndb-servers.ocf
> b/ovn/utilities/ovndb-servers.ocf
> > index 10313304cb7c..cd47426689ef 100755
> > --- a/ovn/utilities/ovndb-servers.ocf
> > +++ b/ovn/utilities/ovndb-servers.ocf
> > @@ -516,6 +516,8 @@ ovsdb_server_stop() {
> >  }
> >
> >  ovsdb_server_promote() {
> > +    local state
> > +
> >      ovsdb_server_check_status ignore_northd
> >      rc=$?
> >      case $rc in
> > @@ -540,7 +542,15 @@ ovsdb_server_promote() {
> >          ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
> >      fi
> >
> > -    ocf_log debug "ovndb_servers: Promoting $host_name as the master"
> > +    ocf_log debug "ovndb_servers: Waiting for promotion $host_name as
> master to complete"
> > +    ovsdb_server_check_status
> > +    state=$?
> > +    while [ "$state" != "$OCF_RUNNING_MASTER" ]; do
> > +      sleep 1
> > +      ovsdb_server_check_status
> > +      state=$?
> > +    done
> > +    ocf_log debug "ovndb_servers: Promotion of $host_name as the master
> completed"
> >      # Record ourselves so that the agent has a better chance of doing
> >      # the right thing at startup
> >      ${CRM_ATTR_REPL_INFO} -v "$host_name"
> > --
> > 2.21.0
>
> Acked-By: Daniel Alvarez <dalvarez@redhat.com>
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Lucas Alvares Gomes July 9, 2019, 7:39 a.m. UTC | #3
Thanks for the patch Michele.

As @Daniel pointed out, this have been tested and works.

On Tue, Jul 9, 2019 at 8:18 AM Michele Baldessari <michele@acksyn.org> wrote:
>
> Currently inside the ovsdb_server_promote() function we call 'promote_ovnnb'
> and 'promote_ovnsb' and then just record the new master state in the
> CIB.
>
> This creates a race because those two promote commands are asynchronous
> so when we exit the ovsdb_server_promote() function the underlying DBs
> are not guaranteed to be in master state. That means that clients might
> connect to an instance that is in read-only mode.
>
> We add a simple sleep loop where we wait for the underlying DB state to
> confirm the master state. We do not need to add a timeout loop because
> in case of an issue the resource timeout set within pacemaker will kick
> in and the resource agent script will be killed by pacemaker.
>
> Tested this within an openstack environment using ovn with roughly ~20
> reboots and was unable to trigger the issue (before the patch we would
> trigger the issue after a couple of reboots tops).
>
> Signed-off-by: Michele Baldessari <michele@acksyn.org>
> ---
>  ovn/utilities/ovndb-servers.ocf | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/ovn/utilities/ovndb-servers.ocf b/ovn/utilities/ovndb-servers.ocf
> index 10313304cb7c..cd47426689ef 100755
> --- a/ovn/utilities/ovndb-servers.ocf
> +++ b/ovn/utilities/ovndb-servers.ocf
> @@ -516,6 +516,8 @@ ovsdb_server_stop() {
>  }
>
>  ovsdb_server_promote() {
> +    local state
> +
>      ovsdb_server_check_status ignore_northd
>      rc=$?
>      case $rc in
> @@ -540,7 +542,15 @@ ovsdb_server_promote() {
>          ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
>      fi
>
> -    ocf_log debug "ovndb_servers: Promoting $host_name as the master"
> +    ocf_log debug "ovndb_servers: Waiting for promotion $host_name as master to complete"
> +    ovsdb_server_check_status
> +    state=$?
> +    while [ "$state" != "$OCF_RUNNING_MASTER" ]; do
> +      sleep 1
> +      ovsdb_server_check_status
> +      state=$?
> +    done
> +    ocf_log debug "ovndb_servers: Promotion of $host_name as the master completed"
>      # Record ourselves so that the agent has a better chance of doing
>      # the right thing at startup
>      ${CRM_ATTR_REPL_INFO} -v "$host_name"
> --
> 2.21.0
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Acked-By: Lucas Alvares Gomes <lucasagomes@gmail.com>
0-day Robot July 9, 2019, 7:59 a.m. UTC | #4
Bleep bloop.  Greetings Michele Baldessari, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 89 characters long (recommended limit is 79)
#50 FILE: ovn/utilities/ovndb-servers.ocf:545:
    ocf_log debug "ovndb_servers: Waiting for promotion $host_name as master to complete"

WARNING: Line is 82 characters long (recommended limit is 79)
#58 FILE: ovn/utilities/ovndb-servers.ocf:553:
    ocf_log debug "ovndb_servers: Promotion of $host_name as the master completed"

Lines checked: 64, Warnings: 2, Errors: 0


Please check this out.  If you feel there has been an error, please email aconole@bytheb.org

Thanks,
0-day Robot
Terry Wilson July 17, 2019, 11:23 a.m. UTC | #5
Is this just waiting on a couple of line length issues to be fixed? Or do
those really matter?

On Tue, Jul 9, 2019 at 3:10 AM 0-day Robot <robot@bytheb.org> wrote:

> Bleep bloop.  Greetings Michele Baldessari, I am a robot and I have tried
> out your patch.
> Thanks for your contribution.
>
> I encountered some error that I wasn't expecting.  See the details below.
>
>
> checkpatch:
> WARNING: Line is 89 characters long (recommended limit is 79)
> #50 FILE: ovn/utilities/ovndb-servers.ocf:545:
>     ocf_log debug "ovndb_servers: Waiting for promotion $host_name as
> master to complete"
>
> WARNING: Line is 82 characters long (recommended limit is 79)
> #58 FILE: ovn/utilities/ovndb-servers.ocf:553:
>     ocf_log debug "ovndb_servers: Promotion of $host_name as the master
> completed"
>
> Lines checked: 64, Warnings: 2, Errors: 0
>
>
> Please check this out.  If you feel there has been an error, please email
> aconole@bytheb.org
>
> Thanks,
> 0-day Robot
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Michele Baldessari July 17, 2019, 12:43 p.m. UTC | #6
On Wed, Jul 17, 2019 at 06:23:56AM -0500, Terry Wilson wrote:
> Is this just waiting on a couple of line length issues to be fixed? Or do
> those really matter?

Hi Terry,

I kind of ignored it because the ovn/utilities/ovndb-servers.ocf file
has already a bunch of lines > 79 chars. I can still fix it up if that
is preferred?

cheers,
Michele

> 
> On Tue, Jul 9, 2019 at 3:10 AM 0-day Robot <robot@bytheb.org> wrote:
> 
> > Bleep bloop.  Greetings Michele Baldessari, I am a robot and I have tried
> > out your patch.
> > Thanks for your contribution.
> >
> > I encountered some error that I wasn't expecting.  See the details below.
> >
> >
> > checkpatch:
> > WARNING: Line is 89 characters long (recommended limit is 79)
> > #50 FILE: ovn/utilities/ovndb-servers.ocf:545:
> >     ocf_log debug "ovndb_servers: Waiting for promotion $host_name as
> > master to complete"
> >
> > WARNING: Line is 82 characters long (recommended limit is 79)
> > #58 FILE: ovn/utilities/ovndb-servers.ocf:553:
> >     ocf_log debug "ovndb_servers: Promotion of $host_name as the master
> > completed"
> >
> > Lines checked: 64, Warnings: 2, Errors: 0
> >
> >
> > Please check this out.  If you feel there has been an error, please email
> > aconole@bytheb.org
> >
> > Thanks,
> > 0-day Robot
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
Aaron Conole July 17, 2019, 2:11 p.m. UTC | #7
Michele Baldessari <michele@acksyn.org> writes:

> On Wed, Jul 17, 2019 at 06:23:56AM -0500, Terry Wilson wrote:
>> Is this just waiting on a couple of line length issues to be fixed? Or do
>> those really matter?
>
> Hi Terry,
>
> I kind of ignored it because the ovn/utilities/ovndb-servers.ocf file
> has already a bunch of lines > 79 chars. I can still fix it up if that
> is preferred?

Those shouldn't be 'blocking' anything.  It's to bring attention
(warning vs. error).  Maybe it hasn't been looked at by Numan / aginwala
yet.  CC'd for more attention.

> cheers,
> Michele
>
>> 
>> On Tue, Jul 9, 2019 at 3:10 AM 0-day Robot <robot@bytheb.org> wrote:
>> 
>> > Bleep bloop.  Greetings Michele Baldessari, I am a robot and I have tried
>> > out your patch.
>> > Thanks for your contribution.
>> >
>> > I encountered some error that I wasn't expecting.  See the details below.
>> >
>> >
>> > checkpatch:
>> > WARNING: Line is 89 characters long (recommended limit is 79)
>> > #50 FILE: ovn/utilities/ovndb-servers.ocf:545:
>> >     ocf_log debug "ovndb_servers: Waiting for promotion $host_name as
>> > master to complete"
>> >
>> > WARNING: Line is 82 characters long (recommended limit is 79)
>> > #58 FILE: ovn/utilities/ovndb-servers.ocf:553:
>> >     ocf_log debug "ovndb_servers: Promotion of $host_name as the master
>> > completed"
>> >
>> > Lines checked: 64, Warnings: 2, Errors: 0
>> >
>> >
>> > Please check this out.  If you feel there has been an error, please email
>> > aconole@bytheb.org
>> >
>> > Thanks,
>> > 0-day Robot
>> > _______________________________________________
>> > dev mailing list
>> > dev@openvswitch.org
>> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> >
Numan Siddique July 17, 2019, 2:46 p.m. UTC | #8
On Wed, Jul 17, 2019 at 7:41 PM Aaron Conole <aconole@redhat.com> wrote:

> Michele Baldessari <michele@acksyn.org> writes:
>
> > On Wed, Jul 17, 2019 at 06:23:56AM -0500, Terry Wilson wrote:
> >> Is this just waiting on a couple of line length issues to be fixed? Or
> do
> >> those really matter?
> >
> > Hi Terry,
> >
> > I kind of ignored it because the ovn/utilities/ovndb-servers.ocf file
> > has already a bunch of lines > 79 chars. I can still fix it up if that
> > is preferred?
>
> Those shouldn't be 'blocking' anything.  It's to bring attention
> (warning vs. error).  Maybe it hasn't been looked at by Numan / aginwala
> yet.  CC'd for more attention.
>

I looked into this patch and acked it :).
I think we can ignore the warning for the shell files.

Thanks
Numan


>
> > cheers,
> > Michele
> >
> >>
> >> On Tue, Jul 9, 2019 at 3:10 AM 0-day Robot <robot@bytheb.org> wrote:
> >>
> >> > Bleep bloop.  Greetings Michele Baldessari, I am a robot and I have
> tried
> >> > out your patch.
> >> > Thanks for your contribution.
> >> >
> >> > I encountered some error that I wasn't expecting.  See the details
> below.
> >> >
> >> >
> >> > checkpatch:
> >> > WARNING: Line is 89 characters long (recommended limit is 79)
> >> > #50 FILE: ovn/utilities/ovndb-servers.ocf:545:
> >> >     ocf_log debug "ovndb_servers: Waiting for promotion $host_name as
> >> > master to complete"
> >> >
> >> > WARNING: Line is 82 characters long (recommended limit is 79)
> >> > #58 FILE: ovn/utilities/ovndb-servers.ocf:553:
> >> >     ocf_log debug "ovndb_servers: Promotion of $host_name as the
> master
> >> > completed"
> >> >
> >> > Lines checked: 64, Warnings: 2, Errors: 0
> >> >
> >> >
> >> > Please check this out.  If you feel there has been an error, please
> email
> >> > aconole@bytheb.org
> >> >
> >> > Thanks,
> >> > 0-day Robot
> >> > _______________________________________________
> >> > dev mailing list
> >> > dev@openvswitch.org
> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >> >
>
Ben Pfaff July 17, 2019, 6:11 p.m. UTC | #9
On Tue, Jul 09, 2019 at 09:02:57AM +0200, Michele Baldessari wrote:
> Currently inside the ovsdb_server_promote() function we call 'promote_ovnnb'
> and 'promote_ovnsb' and then just record the new master state in the
> CIB.
> 
> This creates a race because those two promote commands are asynchronous
> so when we exit the ovsdb_server_promote() function the underlying DBs
> are not guaranteed to be in master state. That means that clients might
> connect to an instance that is in read-only mode.
> 
> We add a simple sleep loop where we wait for the underlying DB state to
> confirm the master state. We do not need to add a timeout loop because
> in case of an issue the resource timeout set within pacemaker will kick
> in and the resource agent script will be killed by pacemaker.

Thanks, applied to master.

Patch
diff mbox series

diff --git a/ovn/utilities/ovndb-servers.ocf b/ovn/utilities/ovndb-servers.ocf
index 10313304cb7c..cd47426689ef 100755
--- a/ovn/utilities/ovndb-servers.ocf
+++ b/ovn/utilities/ovndb-servers.ocf
@@ -516,6 +516,8 @@  ovsdb_server_stop() {
 }
 
 ovsdb_server_promote() {
+    local state
+
     ovsdb_server_check_status ignore_northd
     rc=$?
     case $rc in
@@ -540,7 +542,15 @@  ovsdb_server_promote() {
         ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
     fi
 
-    ocf_log debug "ovndb_servers: Promoting $host_name as the master"
+    ocf_log debug "ovndb_servers: Waiting for promotion $host_name as master to complete"
+    ovsdb_server_check_status
+    state=$?
+    while [ "$state" != "$OCF_RUNNING_MASTER" ]; do
+      sleep 1
+      ovsdb_server_check_status
+      state=$?
+    done
+    ocf_log debug "ovndb_servers: Promotion of $host_name as the master completed"
     # Record ourselves so that the agent has a better chance of doing
     # the right thing at startup
     ${CRM_ATTR_REPL_INFO} -v "$host_name"