diff mbox series

[ovs-dev] ofctrl: Wait at S_WAIT_BEFORE_CLEAR only once.

Message ID 20240305062744.2105705-1-hzhou@ovn.org
State Changes Requested
Headers show
Series [ovs-dev] ofctrl: Wait at S_WAIT_BEFORE_CLEAR only once. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/github-robot-_ovn-kubernetes success github build: passed

Commit Message

Han Zhou March 5, 2024, 6:27 a.m. UTC
The ovn-ofctrl-wait-before-clear setting is designed to minimize
downtime during the initial start-up of the ovn-controller. For this
purpose, the ovn-controller should wait only once upon entering the
S_WAIT_BEFORE_CLEAR state for the first time. Subsequent reconnections
to the OVS, such as those occurring during an OVS restart/upgrade,
should not trigger this wait. However, the current implemention always
waits for the configured time in the S_WAIT_BEFORE_CLEAR state, which
can inadvertently delay flow installations during OVS restart/upgrade,
potentially causing more harm than good. (The extent of the impact
varies based on the method used to restart OVS, including whether flow
save/restore tools and the flow-restore-wait feature are employed.)

This patch avoids the unnecessary wait after the initial one.

Fixes: 896adfd2d8b3 ("ofctrl: Support ovn-ofctrl-wait-before-clear to reduce down time during upgrade.")
Signed-off-by: Han Zhou <hzhou@ovn.org>
---
 controller/ofctrl.c     | 1 -
 tests/ovn-controller.at | 9 +++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

Comments

Mark Michelson March 18, 2024, 6:27 p.m. UTC | #1
Hi Han,

I have a comment below

On 3/5/24 01:27, Han Zhou wrote:
> The ovn-ofctrl-wait-before-clear setting is designed to minimize
> downtime during the initial start-up of the ovn-controller. For this
> purpose, the ovn-controller should wait only once upon entering the
> S_WAIT_BEFORE_CLEAR state for the first time. Subsequent reconnections
> to the OVS, such as those occurring during an OVS restart/upgrade,
> should not trigger this wait. However, the current implemention always
> waits for the configured time in the S_WAIT_BEFORE_CLEAR state, which
> can inadvertently delay flow installations during OVS restart/upgrade,
> potentially causing more harm than good. (The extent of the impact
> varies based on the method used to restart OVS, including whether flow
> save/restore tools and the flow-restore-wait feature are employed.)
> 
> This patch avoids the unnecessary wait after the initial one.
> 
> Fixes: 896adfd2d8b3 ("ofctrl: Support ovn-ofctrl-wait-before-clear to reduce down time during upgrade.")
> Signed-off-by: Han Zhou <hzhou@ovn.org>
> ---
>   controller/ofctrl.c     | 1 -
>   tests/ovn-controller.at | 9 +++++++--
>   2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/controller/ofctrl.c b/controller/ofctrl.c
> index f14cd79a8dbb..0d72ecbaa167 100644
> --- a/controller/ofctrl.c
> +++ b/controller/ofctrl.c
> @@ -634,7 +634,6 @@ run_S_WAIT_BEFORE_CLEAR(void)
>       if (!wait_before_clear_time ||
>           (wait_before_clear_expire &&
>            time_msec() >= wait_before_clear_expire)) {
> -        wait_before_clear_expire = 0;
>           state = S_CLEAR_FLOWS;
>           return;
>       }
> diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
> index 37f1ded1bd26..b65e11722cbb 100644
> --- a/tests/ovn-controller.at
> +++ b/tests/ovn-controller.at
> @@ -2284,10 +2284,15 @@ lflow_run_2=$(ovn-appctl -t ovn-controller coverage/read-counter lflow_run)
>   AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2
>   ])
>   
> -# Restart OVS this time, and wait until flows are reinstalled
> +# Restart OVS this time. Flows should be reinstalled without waiting.
>   OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>   start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -vunixctl
> -OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2])
> +
> +# Sleep for 3s, which is long enough for the flows to be installed, but
> +# shorter than the wait-before-clear (5s), to make sure the flows are installed
> +# without waiting.
> +sleep 3

This change makes me nervous. The comment makes sense. However, I worry 
that on slow or loaded systems, relying on the flows to be written 
within 3 seconds may not always work out.

If there were a way to peek into the ofctrl state machine and check that 
we have moved off of S_WAIT_BEFORE_CLEAR by this point, that might work 
better. But that is something that is hard to justify exposing.

I came up with this possible idea:
  * set wait-before-clear to a time longer than OVS_CTL_TIMEOUT (e.g. 60 
seconds)
  * Restart ovs
  * Use OVS_WAIT_UNTIL(...), just like the test used to do.

This way, we get plenty of opportunities to ensure the flows were 
written. In most cases, this probably will actually be quicker than the 
3 second sleep added in this patch. However, if it takes longer than 3 
seconds, then the test can still pass. If the flows get written 
properly, then we know ovn-controller did not wait for the 
wait-before-clear time.

> +AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2], [0], [ignore])
>   
>   check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \
>   -- ls-lb-add ls1 lb3
Han Zhou March 20, 2024, 11:07 p.m. UTC | #2
On Mon, Mar 18, 2024 at 11:27 AM Mark Michelson <mmichels@redhat.com> wrote:
>
> Hi Han,
>
> I have a comment below
>
> On 3/5/24 01:27, Han Zhou wrote:
> > The ovn-ofctrl-wait-before-clear setting is designed to minimize
> > downtime during the initial start-up of the ovn-controller. For this
> > purpose, the ovn-controller should wait only once upon entering the
> > S_WAIT_BEFORE_CLEAR state for the first time. Subsequent reconnections
> > to the OVS, such as those occurring during an OVS restart/upgrade,
> > should not trigger this wait. However, the current implemention always
> > waits for the configured time in the S_WAIT_BEFORE_CLEAR state, which
> > can inadvertently delay flow installations during OVS restart/upgrade,
> > potentially causing more harm than good. (The extent of the impact
> > varies based on the method used to restart OVS, including whether flow
> > save/restore tools and the flow-restore-wait feature are employed.)
> >
> > This patch avoids the unnecessary wait after the initial one.
> >
> > Fixes: 896adfd2d8b3 ("ofctrl: Support ovn-ofctrl-wait-before-clear to
reduce down time during upgrade.")
> > Signed-off-by: Han Zhou <hzhou@ovn.org>
> > ---
> >   controller/ofctrl.c     | 1 -
> >   tests/ovn-controller.at | 9 +++++++--
> >   2 files changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/controller/ofctrl.c b/controller/ofctrl.c
> > index f14cd79a8dbb..0d72ecbaa167 100644
> > --- a/controller/ofctrl.c
> > +++ b/controller/ofctrl.c
> > @@ -634,7 +634,6 @@ run_S_WAIT_BEFORE_CLEAR(void)
> >       if (!wait_before_clear_time ||
> >           (wait_before_clear_expire &&
> >            time_msec() >= wait_before_clear_expire)) {
> > -        wait_before_clear_expire = 0;
> >           state = S_CLEAR_FLOWS;
> >           return;
> >       }
> > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
> > index 37f1ded1bd26..b65e11722cbb 100644
> > --- a/tests/ovn-controller.at
> > +++ b/tests/ovn-controller.at
> > @@ -2284,10 +2284,15 @@ lflow_run_2=$(ovn-appctl -t ovn-controller
coverage/read-counter lflow_run)
> >   AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2
> >   ])
> >
> > -# Restart OVS this time, and wait until flows are reinstalled
> > +# Restart OVS this time. Flows should be reinstalled without waiting.
> >   OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> >   start_daemon ovs-vswitchd --enable-dummy=system -vvconn
-vofproto_dpif -vunixctl
> > -OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep
-vF 2.2.2.2])
> > +
> > +# Sleep for 3s, which is long enough for the flows to be installed, but
> > +# shorter than the wait-before-clear (5s), to make sure the flows are
installed
> > +# without waiting.
> > +sleep 3
>
> This change makes me nervous. The comment makes sense. However, I worry
> that on slow or loaded systems, relying on the flows to be written
> within 3 seconds may not always work out.
>
> If there were a way to peek into the ofctrl state machine and check that
> we have moved off of S_WAIT_BEFORE_CLEAR by this point, that might work
> better. But that is something that is hard to justify exposing.
>
> I came up with this possible idea:
>   * set wait-before-clear to a time longer than OVS_CTL_TIMEOUT (e.g. 60
> seconds)
>   * Restart ovs
>   * Use OVS_WAIT_UNTIL(...), just like the test used to do.
>
> This way, we get plenty of opportunities to ensure the flows were
> written. In most cases, this probably will actually be quicker than the
> 3 second sleep added in this patch. However, if it takes longer than 3
> seconds, then the test can still pass. If the flows get written
> properly, then we know ovn-controller did not wait for the
> wait-before-clear time.
>

Hi Mark, thanks for your comment! I agree with you that sleep for 3s is not
very reliable. Your suggestion looks better, but I think there is still a
potential problem. The approach assumes that ovn-controller will always
apply the new settings of ofctrl-wait-before-clear. It is true for the
current implementation, but there is nothing preventing us from removing
this logic, so that ovn-controller ignores any ofctrl-wait-before-clear
setting changes after startup. In fact, it is more reasonable to ignore it.
So, let's assume ovn-controller doesn't take care of the changes of the
settings. In this test case, the setting is initially 5s when
ovn-controller starts, and later changing it to 60s doesn't take effect and
ovn-controller still uses the 5s value. And then let's assume
ovn-controller still waits for the 5s after OVS is restarted, and the test
case will pass because OVS_WAIT_UNTIL can wait for much longer than 5s. So
the test will be incorrect.

To avoid this potential problem, I tried another approach. I figured that
simply adding a "ovn-nbctl --wait=hv sync" is sufficient to ensure
ovn-controller had the chance to install flows, without the need to sleep.
So please see if the below diff looks good:

---------------------------------------------------------------------------------------------------------------------------
diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
index 0da6570c5882..3202f0beff46 100644
--- a/tests/ovn-controller.at
+++ b/tests/ovn-controller.at
@@ -2328,10 +2328,10 @@ AT_CHECK_UNQUOTED([echo $lflow_run_1], [0],
[$lflow_run_2
 OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
 start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif
-vunixctl

-# Sleep for 3s, which is long enough for the flows to be installed, but
-# shorter than the wait-before-clear (5s), to make sure the flows are
installed
-# without waiting.
-sleep 3
+# Sync to make sure ovn-controller is given enough time to install the
flows.
+check ovn-nbctl --wait=hv sync
+
+# Flow should be installed without any extra waiting.
 AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF
2.2.2.2], [0], [ignore])

 check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \
---------------------------------------------------------------------------------------------------------------------------

Thanks,
Han

> > +AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF
2.2.2.2], [0], [ignore])
> >
> >   check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \
> >   -- ls-lb-add ls1 lb3
>
Han Zhou March 28, 2024, 7 a.m. UTC | #3
On Wed, Mar 20, 2024 at 4:07 PM Han Zhou <hzhou@ovn.org> wrote:
>
>
>
> On Mon, Mar 18, 2024 at 11:27 AM Mark Michelson <mmichels@redhat.com>
wrote:
> >
> > Hi Han,
> >
> > I have a comment below
> >
> > On 3/5/24 01:27, Han Zhou wrote:
> > > The ovn-ofctrl-wait-before-clear setting is designed to minimize
> > > downtime during the initial start-up of the ovn-controller. For this
> > > purpose, the ovn-controller should wait only once upon entering the
> > > S_WAIT_BEFORE_CLEAR state for the first time. Subsequent reconnections
> > > to the OVS, such as those occurring during an OVS restart/upgrade,
> > > should not trigger this wait. However, the current implemention always
> > > waits for the configured time in the S_WAIT_BEFORE_CLEAR state, which
> > > can inadvertently delay flow installations during OVS restart/upgrade,
> > > potentially causing more harm than good. (The extent of the impact
> > > varies based on the method used to restart OVS, including whether flow
> > > save/restore tools and the flow-restore-wait feature are employed.)
> > >
> > > This patch avoids the unnecessary wait after the initial one.
> > >
> > > Fixes: 896adfd2d8b3 ("ofctrl: Support ovn-ofctrl-wait-before-clear to
reduce down time during upgrade.")
> > > Signed-off-by: Han Zhou <hzhou@ovn.org>
> > > ---
> > >   controller/ofctrl.c     | 1 -
> > >   tests/ovn-controller.at | 9 +++++++--
> > >   2 files changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/controller/ofctrl.c b/controller/ofctrl.c
> > > index f14cd79a8dbb..0d72ecbaa167 100644
> > > --- a/controller/ofctrl.c
> > > +++ b/controller/ofctrl.c
> > > @@ -634,7 +634,6 @@ run_S_WAIT_BEFORE_CLEAR(void)
> > >       if (!wait_before_clear_time ||
> > >           (wait_before_clear_expire &&
> > >            time_msec() >= wait_before_clear_expire)) {
> > > -        wait_before_clear_expire = 0;
> > >           state = S_CLEAR_FLOWS;
> > >           return;
> > >       }
> > > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
> > > index 37f1ded1bd26..b65e11722cbb 100644
> > > --- a/tests/ovn-controller.at
> > > +++ b/tests/ovn-controller.at
> > > @@ -2284,10 +2284,15 @@ lflow_run_2=$(ovn-appctl -t ovn-controller
coverage/read-counter lflow_run)
> > >   AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2
> > >   ])
> > >
> > > -# Restart OVS this time, and wait until flows are reinstalled
> > > +# Restart OVS this time. Flows should be reinstalled without waiting.
> > >   OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
> > >   start_daemon ovs-vswitchd --enable-dummy=system -vvconn
-vofproto_dpif -vunixctl
> > > -OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 |
grep -vF 2.2.2.2])
> > > +
> > > +# Sleep for 3s, which is long enough for the flows to be installed,
but
> > > +# shorter than the wait-before-clear (5s), to make sure the flows
are installed
> > > +# without waiting.
> > > +sleep 3
> >
> > This change makes me nervous. The comment makes sense. However, I worry
> > that on slow or loaded systems, relying on the flows to be written
> > within 3 seconds may not always work out.
> >
> > If there were a way to peek into the ofctrl state machine and check that
> > we have moved off of S_WAIT_BEFORE_CLEAR by this point, that might work
> > better. But that is something that is hard to justify exposing.
> >
> > I came up with this possible idea:
> >   * set wait-before-clear to a time longer than OVS_CTL_TIMEOUT (e.g. 60
> > seconds)
> >   * Restart ovs
> >   * Use OVS_WAIT_UNTIL(...), just like the test used to do.
> >
> > This way, we get plenty of opportunities to ensure the flows were
> > written. In most cases, this probably will actually be quicker than the
> > 3 second sleep added in this patch. However, if it takes longer than 3
> > seconds, then the test can still pass. If the flows get written
> > properly, then we know ovn-controller did not wait for the
> > wait-before-clear time.
> >
>
> Hi Mark, thanks for your comment! I agree with you that sleep for 3s is
not very reliable. Your suggestion looks better, but I think there is still
a potential problem. The approach assumes that ovn-controller will always
apply the new settings of ofctrl-wait-before-clear. It is true for the
current implementation, but there is nothing preventing us from removing
this logic, so that ovn-controller ignores any ofctrl-wait-before-clear
setting changes after startup. In fact, it is more reasonable to ignore it.
So, let's assume ovn-controller doesn't take care of the changes of the
settings. In this test case, the setting is initially 5s when
ovn-controller starts, and later changing it to 60s doesn't take effect and
ovn-controller still uses the 5s value. And then let's assume
ovn-controller still waits for the 5s after OVS is restarted, and the test
case will pass because OVS_WAIT_UNTIL can wait for much longer than 5s. So
the test will be incorrect.
>
> To avoid this potential problem, I tried another approach. I figured that
simply adding a "ovn-nbctl --wait=hv sync" is sufficient to ensure
ovn-controller had the chance to install flows, without the need to sleep.
So please see if the below diff looks good:
>
>
---------------------------------------------------------------------------------------------------------------------------
> diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
> index 0da6570c5882..3202f0beff46 100644
> --- a/tests/ovn-controller.at
> +++ b/tests/ovn-controller.at
> @@ -2328,10 +2328,10 @@ AT_CHECK_UNQUOTED([echo $lflow_run_1], [0],
[$lflow_run_2
>  OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
>  start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif
-vunixctl
>
> -# Sleep for 3s, which is long enough for the flows to be installed, but
> -# shorter than the wait-before-clear (5s), to make sure the flows are
installed
> -# without waiting.
> -sleep 3
> +# Sync to make sure ovn-controller is given enough time to install the
flows.
> +check ovn-nbctl --wait=hv sync
> +
> +# Flow should be installed without any extra waiting.
>  AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF
2.2.2.2], [0], [ignore])
>
>  check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \
>
---------------------------------------------------------------------------------------------------------------------------
>
> Thanks,
> Han
>
I sent the above change as v2:

https://patchwork.ozlabs.org/project/ovn/patch/20240328065835.789596-1-hzhou@ovn.org/

Thanks,
Han

> > > +AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF
2.2.2.2], [0], [ignore])
> > >
> > >   check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \
> > >   -- ls-lb-add ls1 lb3
> >
diff mbox series

Patch

diff --git a/controller/ofctrl.c b/controller/ofctrl.c
index f14cd79a8dbb..0d72ecbaa167 100644
--- a/controller/ofctrl.c
+++ b/controller/ofctrl.c
@@ -634,7 +634,6 @@  run_S_WAIT_BEFORE_CLEAR(void)
     if (!wait_before_clear_time ||
         (wait_before_clear_expire &&
          time_msec() >= wait_before_clear_expire)) {
-        wait_before_clear_expire = 0;
         state = S_CLEAR_FLOWS;
         return;
     }
diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
index 37f1ded1bd26..b65e11722cbb 100644
--- a/tests/ovn-controller.at
+++ b/tests/ovn-controller.at
@@ -2284,10 +2284,15 @@  lflow_run_2=$(ovn-appctl -t ovn-controller coverage/read-counter lflow_run)
 AT_CHECK_UNQUOTED([echo $lflow_run_1], [0], [$lflow_run_2
 ])
 
-# Restart OVS this time, and wait until flows are reinstalled
+# Restart OVS this time. Flows should be reinstalled without waiting.
 OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
 start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -vunixctl
-OVS_WAIT_UNTIL([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2])
+
+# Sleep for 3s, which is long enough for the flows to be installed, but
+# shorter than the wait-before-clear (5s), to make sure the flows are installed
+# without waiting.
+sleep 3
+AT_CHECK([ovs-ofctl dump-flows br-int | grep -F 10.1.2.4 | grep -vF 2.2.2.2], [0], [ignore])
 
 check ovn-nbctl --wait=hv lb-add lb3 3.3.3.3 10.1.2.5 \
 -- ls-lb-add ls1 lb3